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The present invention relates to methods of operating feconfi- 
gurable arrays of data processing elements. 

When using sucharrays, it is desired to optimise the way the 
array is coupled to other units, e.g. to a processor if used 
as a coprocessor and/or to optimise the way in which the array 
is configured. 

The present invention aims at providing improvments over the 
prior art. 

It is to be noted that the disclosure of the present invention 
does comprise two major parts in its description that both re- 
fer to ways of allowing for an optimum use of the array and 
hence are closely related to each other. 

It is also, to be noted that the shorter of the two' major parts 
does comprise a plurality of figures that the text relates to 
however without always giving an exact, precise and correct 
reference. Yet any deviations from correct referencing will be 
obvious to the average skilled person. 
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1 Executive Summary 

The study is concerned with three objectives: 

1. Proposal of a hardware framework, which enables an efficient integration of the PACT XPP 
core into a standard RISC processor architecture. 

2 " °.f * u C °I npiIer /° r * e CO "P led WSC+XPP hardware. This compiler decides 

3 " SlI? 0 " ° f 8 ^ ber f C3Se StUdieS ^onstrating which results may be achieved by 
using the proposed C Comp.ler in cooperation with the proposed hardware framework * 

The proposed hardware framework accelerates the XPP core in two resnects First rf«t a «,.«..»», , • 
mcreased by raising the XPFs internal operating frequent ^ rtgfof S SSKSEL? 

S^'i^ SfTE^ , C3C i e ^ Her iS Educed, which manages the cache contend 

^°* e [.P™ b,ern ^rging with a coupled RISC+XPP hardware is concerned with the RISCs 
multitasking concept It becomes necessary to interrupt computations on the XPP fn Jr! 

■be duplication of IRAM cells. ** mStanCe ' much ■**■ *«■> 

The compiler removes the necessity of developing ; Nm£. code for Se Sp hT« H ?^ ^ 5"* 

perform t^ompilJon ^J/f^^ 1? f' Op0Sed C ° mp,ler includes Aree components to 

1. partitioning of the C source code into RISC and XPP parts, 

2. traiisfonnationstoorjtimizethecodefbrtheXPPand 
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3. generating NML code. 
Finally the generated NML code is placed and routed for the XPP. 

'SLSSS&S^Sa *1 C °"* flBr d r ideS ^ of m application code can be 
£ ^ ? 1 h part5 316 executed ™» * e RISC. Typi ca | candidates for becoming XPP 
code are loops wrth a large number of iterations whose loop bodies are dominated bTSnSc 
operates. The naming source code - including the data transfer code -T«£K3 *>r the 

The proposed compiler transforms the XPP code such that it is optimized for NML code eeneration 
cone so mat it fits tnto the XPP array and that the Final performance exceeds the cure MSr 
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2 Hardware 

2.1 Design Parameter Changes 

2.1 ,1 Pipelining / Concurrency / Synchronicity 

RISC instructions of totally different type (Ld/St, ALU, MuW?iv/MAC FPALU fpm,,i a 

u, mum mwng operations and implementations of simultaneous multithreading (SMT). 
exists of eapUci, * ™ S " Uar0n »« ^ ™» *> «™ necessitate, a, e 

2.1.2 Core frequency/ Memory Hierarchy 

*»* «*q«enoy with fts core «S^,o^ i^TL£rTw£ °" T*°* MCeeds te 
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w£ S^«S5SS d0e -fi. n0t h ' 1P t0 Spe6d UP cora P utationfl wWc h Muffle large amounts of data, 
? ~* COmpUtati0nS 316 ralled " bou n^ ^ memory bridwiduY\ However 

2? ^eVtnsrrTr* more , data l Ta ty (another name *■* ^ p^a^ 

?2 w-S ♦ 7 ^ U u PPer layerS ° f 1118 memor y MwwAy. This is the class of applications 
that gain the highest speedups when a memory hierarchy is introduced. 

^^f^fWri^twn can be used to transform memory- bounded algorithms with a data set too hi* 
SI ^° 5eS mem °7 reusa oa a scale. As the new data set size is c^sTto fit h? 0 ^S 

raemory * e a,sorithm is « ^ 

2.1 .3 Software / Multitasking Operating Systems 

Memory Hierarchy 

^cilolrt enHanC !% the -fappHcations that can be implemented 

applications are fdtennU* a^ «» * bo - ded ' Typical 

third dimension ofaaTcoW^^oenTuD ^° P rccessin S, where a seconded 

compression algorithmt, STSS Sre^in^l dt^? 15 ^J*?™** ^ P**" and video 
consecutive picWof \ fo? a Tl ° t^ B T'\. to fU,d even searching 

algorithmic compTiity as tll^Jer metST" al8 ° rithms ^ a rou( * highef 

• design or they can transformed St TS^SSV™?:?* data loCaI ' eith ^ by 

higher c.oc k £^^ *• —7 hierarchy and the 

Muiti Tasking 

-uS^ 

environments With mTtiXw J owrarin^^^T T USUaU) ' ° perated in making 
basis, thus simulating cowu^t »?^ m ^ executed application on a regular 

operating ^temhlsVs^ To switch tasksfthe 

and then reload the state of mJSnSkSH I ? ° f reg,Ste,S -> of *• 

P--"^^^ -e state of the 

^arrra^h^^fH?^"" T -d deeply pipeHned 

W«^^te&STffiiS?r ""f 1 ^ W8h C ° re But high memory 

memory **<m£^^^^*°* * *• betweS core »d 

Deep pipelines incur pipeline stalls du^to 1^ 2™^ availab,e memory. 

mispredicted conditional branch^ SpecLLf LSf ,? W « " " branch P 6 " 31 ^ ^ 
integer^yprograms.Formesereasota^ «* *r 
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^P™ aSS^ ^L^T»r T, S M S Iti ~ S (SMI), adds hardwa. 
than one independSSo^S to belted "En * w£ * ^ ^ 

or doesn't utilize all functional imST 1 " - ' wtie ? iever one instruction stream stalls 

utilization for today's p^elore. ' °*/ °™ 0an JUmp ,n " ™» im P*™* Actional unit 

the processor has t0 ba 

combination of the PACT XP? a* a^taXd S ™J ? P Sj?* 83 T dI 35 P 035 * 1 * For 
configurations execute longer than thTave^e^sr ■ '• MT J? Veiy benefic H since the XPP. 
other^ctional units, while a cXIJ^J^.T^i^ ,,, ^ h * <*« ^ 
the XPP, so while one such non-XPP task is^Tuiiu^^ anonW OM^lTbe'able'to u^Ae^Pp , rore rt '' ,Ze 

2.2 Communication Between the RISC Core and the 
XPP Core. 

2.2.1 Streaming 

outputs. As the pipelines takeaTon^ £ to fi n JSP"? "*?*»■*»»» ** feature few Liputs and 
limited (as will ^described «£SSS52W^5" """^ *~ ° f 3 «^JL^t 
to b.gger arrays and array fiequencies nS^SOco^^ ^ X " UCaAoa ^ SCa,e 

■ Streaming from the RISC core 

if the XPP core is reading CS.T££ SL ^"2^ 

■ Streaming via DMA 

25.2 Shared Memory (Mafn Memoiy) 

approach suffers from Hie same limStions S C ^ nUn,ber of 10 P orte « v «y limited th is 

of using PAEs for address Z^TL^^Z^T^ 1°*- ^ 

values from very sparse arrays. S " However * IS approach is still useful for loading 

2.2.3 Shared Memory (IRANI) 

There are several ways to fill the TRAMs with data. 
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■ The IRAMs can be loaded in advance by separate "load" configurations using streaming. 

^^S^Tm ^ ^ V regiStCIS - AS ^e, this will limit the 

performance of the XPP array, especially as the IRAMs will always be part of the externa^ 
vwible state and hence must be saved and restored on context switches. «*™uy 
• The IRAMs can be loaded in advance by separate "load" instructions. 

ISmH^ tT^tZ^ ^^ C ° nfisUratiDn configuration reloads. 

Th£ ' ^ JT E ^ u,stmct,ons mav use a wider interface to the memory hierarchy 

This corresponds to the usage as vector registers. y 

" co'ntrSef" ^ * ^ by * ' W P ^ *° m * the cache 

' 2*?* m ° dB hOWBVCT 18 8 combination " f * e Prions solutions with the extension of a 

^AMx^^r 10 " T, a j^ mc memor y defined by starting address and size to 

A^S". re£5 l c 8gen3 a (d !5 ye u d * low priority > burst load ^ memory hierarehy (cache? 
After all IRAMs are mapped, the next configuration can be activated. TteartivS Sun i 

^uTlf^ 5? ^ ^" Pleted - H * wever ' »f *• P^oad tostrStonTa^issu^g 
Se^yS" D ° ° r ** cacba Reality, the wait w?l| no? 

2.2.4 Proposal 

within fee RISC jHESSI a ~ °? ^ * *" nwmo, y W«™*y existing 

PAE amy toeauSreTi sn™TS^ ^Ircitly preloaded configured™ cache within to 
configXn?. emf " oyed <° »W« pretoadaig of configurefions end fast switching between 

IttStoiS&ZSS&ltt^jr*^'*^ nnmbeeof soch 

Instructions 
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XFFPre!oadCoafig(void *ConfigurationStartAddress) 

Se 0 ^^ 011 b " Prel0ad ^ >° be l0aded into juration cache within the 
^viollr SpeCU,ative P reloads P«*^ «nce successive preload commands overbite the 
P °' nter re8iSter ° f WSC P ° inter rcgist€r f,Ie - size is ™P^y stained i„ 

XPPFreload (int IRAM, void *StartAddress, int Size) 
XPPPreloadClean (int IRAM, void *StartAddrcss, int Size) 

The parameter is Hie IRAM number. This is an immediate (constant) vahie. 
^RlSS^ZiSSZS.'" — • ™* I— is provM* h . ^ 

mSlliSSSS^ ™ S fa " » — ••» • P-**V of fte 

The first variant actually preloads the data from memory 

XPPExecutc 0 

XpjpSync (void *StartAddress, int Size) 

^P^iS^Se^ uTL^r ? ' overfa P S-n memory area. If 

•his inaction oan a.so be used to ^^i^^SiK^ ^ *" *** mem0 *>> 
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Cache 
f 



Figure I: Memory interface 

tor (invito; i <i oop; p+y ( ' 
XPPPrcloadConfist CONFIC1 ); • 
XPPPfeloaa( IRAM2. 0x20000. 30 ); 
XPPPrelood( IRAMO, 0x20400* 200 ): 
XPpPB=io««lcic«n< IRAM5, O.i200oo 10 V 

r r ' 

«Ul«r RISC computations 
In the mennwWlB the buret preloads and 
*° Precis BonnguraUon are running 

XPFExecute(CONHGl J- 
/* 

Other RISC eomputatioas 
maybe burst preloads of 
aauihcr configuration and other data 

) 

Note: ta an ptaeo, where «>r S tan« are u«d 





vutnal processor in an SMT environment TtehftT ^ .^1 O"" «° >» duplicated for every 
address, siM „„«, satt < , .^X, 7 telS S"^ 3* *T ,* "* e >>°*°™*>l<>e ^tS 
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co^Sf ^ iS3Ued m IRAM ' &e PreVi0US Ulfonnation is retai ^ ^mating identical preload 




-.ram 



Figure 3: Asynchronous pipeline of the XPP 

«n?LT2^ (sta,Ied) ^ aU necessat * ^ s - 

controller. Hence the cacCconn^^ OT by ° ache 

commands as well actually stXe meWl^n * ° ^f? s 5* chron ^on and execute 
of the configuration! d^S ^SSn™ 35 ^ ^ "' 9 ^ ^ the termination 
not reused in the same Sp7aE .I^h ^ ^ i°° n 33 P ° SsibIe > tf * eir ■«« * 

as a single unit since they^ J6 nothave dtf^^^ ^ Cache "'^^ ca * therefore be seen 
seen aAe ^^^^^ 0^^^^- *J **** cofllro " er «• 
of th. XPPpipelife, a, S o tn^t^^^ ^ te b ** W 

FIFO (-pipeline) for loos fSuo^sT?f - USmg confi S u ^°" and data preload 

alreadyZea p Joad^Ttne « - 

speculative; the amount of speculation is the comnn^iIS £ « Preloaded These preloads can be 
be several configurations- it is limiKdild^S ller ^ ade -°f ™ e l^ng* of the preload FIFO can 
ability to schedule Sa<£ ^X^JS^S^"^ Pr0p6rtieS co ™Pi^ 

done optimally by sSES^ ^ ? l Y °P eratl °"' interlocking cannot be 

Hencerte X?F ^cac^c^Sef atdIL i^PA^ blocking), 
independent funcSn J S3 P W ^ can be seen 33 ***** but not tetany 



13 



02-JUL-2003 16:09 




+49 721 46930a s ^ lg 



all write backs blocked 
by In-use TRAMs 




wait 


confieuration 


execute 


finished 




preload needed 
urgently 



write back 


« all preloads blacked by . 






dirty or in-use IRAMs 


preload 



no clean 
IRAM Instance" 



discard LRU 
clean IRAM 



no empty 
IRAM instance 



_ Figure 4: state ftwaitkm diagram for the XPP cache controller! 

^L^^Z^X^r t PiCted ? ~" i0 State 
soon as the condition Snt »y more^S 3^^^? ^ ed ^- As 
states are as follows: te tranait w n place. The activities for the 

^ fi t*d„^ foKiU *-* -edp re ,oad commands, while 

£ssx*r S;ssrr? on ^ be — - * • — - gent t** 
i ^ saw ^^t7:^r^%r tz't->** — «■* 

waiting for the configuration to finish ™d ^nr L f !* 1116 former can be resolved by 

used c/ e «„ IRAM cI^^lT^L^l^ T ° ^ *. least recently 

a dirty one has to be writtej ^2? ^^S^SSS^ "J** 0 '* ^ KAM instaace s*** 5 . 
*V IRAM instances exist, siD^^S^^^ 5 ^ nn ^ t t 0CCUr ** no e W> cfe OT or 
instance in an IRAM block - otftf ~ "* ' cnpnes ^ an tero "« u " -~ * k ~- ' J 1 " 



&and there should be more' than one 
ed. 
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2,2.5 Further Improvements 

Write Pointer 

beyond this write pointer is encounte^TOrbli^^^T-r^ 13 required ' un,e « an access 
after a task switch: V de[ ay I?S?Sb«^ V ^ IRAMs haVe 10 * 

engine of the cache controller SLoJSttTStodS m^M™* Preload 
reloading 3 me Wockl «g IRAM next whenever several IRAMs need 

Longer FIFOs 

« mB ^> — « * «— - the same 

PACT XPP core, the prefetch fS^^-EEE^ c0 °«?« 9r b8tW6en fte WSC c <*° ml™ 
for several conf.gunXL can b^i^S, ^t£JSSSf" ^ 1,16 1RAM 

mafces clear which ISAM A simple convention 

to the next configuration context This can be ISmSSSS 2 the configuration execute switches 
every configuration execute, while leaving ?i52?2 * tbe ^ ° write pointer with 

entries ke ep their eontents ^ ^ Z^offoZZ*" 10 ^™ 5 ^" 1 IRAM lio 

the preceding configuration', IRAtVfa if no dSn^ 0 ^^^^ ^figuration will use 

If none of the memory areas to be conied midam ■ • 

clear the situation. Note however Z ^ZS^o^LV^ A ^ «*»*»>• P«««=ol =2 
any new IRAM area nod . cimemTdirrv Svr ^15" T""" more 'fan overt™ betwe™ 

Data Distribution 

oorapowion. So it fa be*. if m^a^S^' r"» ta " <* JABs available fo? the 

top of the memory hierarchy fa the source of fee daTJiT: «o «move thfa hoffleneck. The 

*- the access pattern ia changed, as ^Z^^Z^gg**- — r focuses 



f5 



02-JUL-2003 is: ia Qkni 

+49 721 4S930S S.21 

Eehlert Unbekanntes SctialtefarqumenL B ce cuttv e Sumtnafy * | 

Many algorithms access memory in linear order by definition to utilize block readin- and simnle 
address calculous. In most other cases and in the cases where loop tiline ^neededTa SJSE 

f In mflay ofthe fining cases, the compiler can modify AeWss 

pamm by data layout rearrangements, so that finally the data is accessed in the oSired oaLm S ' 
of these optau«tai can be used because of dependencies, or becaSe^Tayow Ze 
are still two possibilities to improve performance: X 

Data Duplication: 

o «^™n«M)i^^tos» 1 oiioatoM n , aWp i e | Rj ^ conmmM1) ,. 

The interface of this instruction looks like: 
XFPPreloadMuttipIe (iut IRAMS, void *StartAddress, int Size) 

«+!E$E^ " ** ' XPPPrel - dClea » the option 

Ji e ^^ t P aj ^ n f* r is **AMS. This fa an immediate (constant) value. The value is a bitmap - for 
every b.t m the bitmap, the IRAM with that number is atarget for the load operation 

There is no "clean" version, since data duplication is applicable for read data only. 
Data Reordering: 

S^^Tm* C !T 8eS acceSS pattem 10 ** data onI y- !t does not change the amount of memory ' 
ftat is read. Thus the number of cache misses stays the same. <""««m memory 

o Addmg additional functionality to the hardware: 

o Adding a vector stride to the preload instruction. 

tJZSL*? s f ,a f ment dements in memory) is used in vector load 

operations to load e.g.: a column of a matrix into a vector register. 

™ S ' 3 * e most regular non-linear access pattem. It can be implemented in hardware 
by giving a stnde to the preload instruction and adding the stride to the IRAM 

1B 
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ISES^ Stat «2? f 10 ^ thi3 tosfruction « that tfa» number of possible 

Z*l r s ?1 IR ^ oad nses: 111 1116 Won5t case ft be «• «*• «« P» 

loaded value jf the stnde is equal to the cache line size and all data is not in the cache 
changes. Still this is an undesirable effect. 

Sn„°^ ? 7 r0bl ?i B1 ** C0 ?'P lexit y of 49 implementation and a possibly limited 

ssff-'yESfit* TransferTins non - contiguous ^ « «s 

The interface of the instruction looks like: 

SS^S^S^^ Y0id ^tartAddress, bit Size, int Stride) 
XPmeloadaeaoStride^tiR^ 

^StpS 9 * mW ~ l / ^-dClean instructions with the 

o Reordering the data at run time, introducing temporary copies, 
o On the RISC: 

memory is of no concern ^ HenCe * e wite back °V^ion to 

o Via an XPP configuration: 

2.3 State of the XPP Core 
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The state of the XPP cpre can be classified as 

1 Read only (instruction data) 

- configuration data, consisting of PAE configuration and routing configuration data 

2 Read-Write 

• the contents of the data registers and latches of the PAEs, which are driven onto the busses 
■ the contents oflhelRAM elements 

2.3.1 Limiting Memory Traffic 

There are several possibilities to Umit the amount of memory traffic during context switches. 

Do not save read-only data 
Save less data 

ana J&itS^St'" 1 P-TPtive), an of the local slats on the bosses 

Save modified data only 

^ssr__s_ar_c 

Use caching to reduce the memory traffic 

reduce the memory traffic ^^n^ !11 P - ^ OTS dn ™ g the *-* This cache can also 

-^~v£t_-s_sr aRsasasi-* »— * U5 * <«» 

X?__r__?3 ____£_,■*£ ^r^T" 01 * -_«_ 

.epiicated as for SS*T_S2 Son^5.^^ tt °!!^ he ™™ ,M " 2 ' cs,,s * t » lkl "■«> be 

ZZ^2ZZX£ttSSZ&££i« - -* «— — * - 

~ bS__5_t5" ' <U0mOdified) *— » Co* h e„« 
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2.4 Context Switches 

Usually a Processor is viewed as executing a single stream of instructions. But today's multi taskW 
operating systems support hundreds of tasks being executed on a single ZceWo This i s SSff 

ST* "* - *" "* S v**** - 15 ""ta^ "** the state of another task, that will be executed 

%Z*Z^^£ l^^^nT^ Qf ^ P roC6S50rs ^ ^ultaneous 
(ISR)andTSawS HypeiThreaduig), execution of an Intenupt Service Routine 

2.4.1 SMT Virtual Processor Switch 

totaUy in harfwar*. Actions 

parallelism and improve fwS St „ f uistruction stream to increase instruction level 
reloaded from memory bSSSiSSS^S^ ?»e processor state cannot be,stored to and 
of alternating ^cZ^^^^^^u ^ 'a^T 011 Streams: ^rst case 

Processor sfcfte to memo" SSlES^ * * ° f Cycles * ^ the 

T? * -ry virtual P ™ Every 
counter was used to fetch the ^^on^vZ^V ^ ° f VUtUal processor « whose Program 
be inserted to select one ofte^S^ 

Sttl^dtSSi*' aiUCOn « * so the si Z e of the 

SietentS/S^ fte * ate * thus enabling efficient 

2.4.2 Interrupt Service Routine 

thclSR. ^srore me state of all other resources, that are actually used within 
The more state information to be saved, the slower the 
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2.4.3 Task Switch 

This type of context switch is executed totally in' software. AH of a task's context fstetel h„, to ^ 
aVXT* ^^-^^^ »-v task has to be Xded 2^\3£££2n allowed 

2.5 Software / Hardware Interface 

to.the hardware, the 

handling will be discussed: • g ln * e follow *g * e most promment changes and their 

2.5.1 Explicit Cache 

Hie proposed cache is not a usual cache which i,= » . 

invisible to the programmer / SpTler as its on ^l • ~£ conslde ™g performance issues - 
explicit cache. IbstaVhaa to taXSJbJ soSSS " 11,8 P"*-* «**• is an 

Cache Consistency & Pipelining of Preload / Configuration / Write back 

Only this DUMwiH T£2J3^ 'EtSZ^f** * I"* 6 * * 

IRAMs is wrirten, it is not 52^*1^.1 * memory. If however more than one of the 
deterministic behavior) * * ^ Wdl be W,tten to memo ^ This is a software bug (non 

controller and the XPP array, *n SSS^LIS^S- C ° nta,n ** hazards - As 

these data hazards are e3S to SETS « fcwbanal units, which are effectively pipelined, 

°*marypipelinW 

• Hardware interlocking: 

S5?MM5^^ C3Che ~' ^ *« *° -g of a 

to stall that preload, enS^etlTz^ nf*? ^ ^ Qad > fc *** 

preload. •«u»*ing me execution of the current configuration and the • 

• Software interlocking; 

modular alias- and data- de^nde^ a\i£S ^"JS?? ' In f" and fotcr- 

fcheduling algorithms help to S^iaS^e i^J oTZ * *" 15 while 
instructions. unpaci of the necessary synchronization 
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In either case, as well as La the case of pipeline stalls due to cache misses. SMT can u « tf,« 
computation power, that would be wasted otherwise. * - * B 

Code Generation for the Explicit Cache; 

Apart from the explicit synchronization instructions issued with software interlockine. the follow™ 
instructions have to be issued by the compiler. mtenocKing, tne following 

" S"lh S , Urati °« Prel ° 3d in f™ ctions ' Preceding the IRAM preload insthirtions, that will be used 
' S^sladuir" 0 " 5 ' t0 °' *°" H bC SChedu,ed - "* - Po-ble by the 

Asynchronicity to Other Functional Units 

A configuration wait instruction followed hv an ,„c«^,,^»: r - . 

by the compile, if an instruct «— 
memory area, that is potentially dirty or in-use in an IRAM tSvJt L St ? a,t) ^ access a 
instruction streams and the cache contents rfZ^" ™ xs A forces a synchronization of the 

inter-modular array ^J^^SS^TS^^iT^ A ?°™ Ugh inter "P roced ^ ™« 
acceptable level. frequency of these synchronization instructions to an 
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3 Program Optimizations 

3.1 Code Analysis 

|!SvS„? 'F^Z** t^t can be performed on programs, these analyses am then 

used by dtfferent optimizations. They describe the relationships between data and memory locations^ 
the program. More details can be found in several books [2^,5J. . memory locations m 

3.1.1 Data-Flow Analysis 

- Pr orf[{J Y(ih[/]- %>p[/)) 
ft means that data available at the end of the execution of object i, Ex[i], are either produced by / 

These equations can be used to sojve several problems like: " 

■ the problem of reaching definitions, 

• the available expressions at a point in the program, 

■ the live variables at a point in the program, 

who se solutions are then used by several compilation phases, analysis, or optimizations. 

As an example let us take the problem of computing the Def-Use chains of the variables of a nmo« m 

all vH2T.^SZ?i* PI'V! 6 ch £ n 13 associated to each definition of a variable and is the set of 
blocks tJ d^T" Xtil ?i^ n - **• nations presemed above are appISn to ine bast 
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Figure 6: Control-flow graph of a piece of program 



3.1.2 Data Dependence Analysis 

A data dependence graph represents the dependences existing between operations writing or reading 
the same data. This graph is used for optimizations like scheduling, or certain loop optimizations to 
test their semantic validity. The nodes of the graph represent the instructions, and the edges represent 
the data dependences. These dependences can be of three types: true, (or flow) dependence when a 
Variable is written before being read, anti-dependence when a variable is read before being written, 
and output dependence when a variable is written twice. Here is a more formal definition [3]. 

Definition 

Let S and S' be 2 statements, then S' depends on S, notedS 5 S' iff: 

(1) S is executed before S' 

(2) 3veVAR:v<sDEF(.Sr)I USERS') we LtS£(S)I £>£F(S')v v e DEF{S)l DEFi?'} 

(3) There is no statement T such that S is executed before T and T is executed' before S\ and 
v<sDEF(T) 

Where VAR is the set of the variables of the program, DEF(S) is the set of the variables defined by 
instruction S, and USB(S) is the set of variables used by instruction 5. . 

Moreover if the statements are in a loop, a dependence can be loop-independent or loop-carried. This 
notion introduces the definition of the distance of a dependence. When a dependence is loop- 
independent it means that it occurs between two instances of different statements in the same iteration, 
and then its distance is equal to 0. On the contrary when a dependence occurs between two instances in 
two different iterations the dependence is loop-carried, and the distance is equal to the difference 
between the iteration numbers of the two instances. 
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SST 1 rf *7* on . of dependence generalizes the notion of distance, and is generally used when 
dtSL w dep ! nd ^ l f n ? ^nstant,.or cannot be computed with precision. The direction oS 

2SfT^STl^? ? C u d - PendeaCC betWB6n 5and S ' 0ccure wh * n * e ^stance of 5 is in an 
KfE22Ti??- , ^T l ? the i nstanc ? ofS > = tf *«> stances are in the same iteration, and > 
li the instance of 5 is an iteration after the iteration of the instance of S'. 

In the ca Se of a loop nest, we have then distance and direction vector, with one element for each level 
EScS ^ J*»» « lus ^ -B definitions. The data depeSe £2 

loop can be vectorized if its data dependence graph does not contain any cycle. 



far (±=0,- i<N; i=i+l) { 
S; a[i] = b[i] + 1; 
Sl: c£i} - a(i] + 2; 
> 




Figure 7: Example of a true dependence with distance 0 an array a 



for(i=0; KB; l«=i+i) { 
S: a[i] =. b[i] + 1; 
SI m e[i] + 2; 




Figure 8: Example of an anti-dependence with distance 0 on array b 

S: afi] - b[i] + i, 
si: a[i] = o[l] + ^ ; 




Figure 9; Example of an output dependence with distance 0 



On array a 
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for (j'-O; j<=N;j++) 
• for(i=0;5.<5=N;i++) 
( 

SP:=?r c(i] [j] =0; 



3. 

for(k=0;fc<=N;k++) * I 5 (=,=) 

cm r-n =» cm r-n + am rici *hrici r-ii A. 



S2: c(i][j] + a[i] [k]*btk][j]; 

} 



Figure 10: Example of a dependence with direction vector(<=*,zs) 
between SI tmd S2 and a dependence with direction vector (= -, <) 
between S2 and S2. 



S.30 
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Figure U : Example of an anti-dependence with distance vector (0,2). 



3.1.3 Interprocedural Alias Analysis 

The aim of alias analysis is to determine if a memory location is aliased by several objects, like 
variables or arrays, in a program. It has a strong impact on data dependence analysis and on the 
application of code optimizations. Aliases can occur with statically allocated data, like anions in C 
where all fields refer to the same memory area, or with dynamically allocated data, which are the usual 
targets of the analysis. In Figure 12, we have a typical case of aliasing where p alias b, 

int b[100] r *p; 

fortp=b;p < Sb[100] ;p++) 
*p=0,- 



£ or ( i=0 ; i<=N; i++ ) 

for(j=0,-j<=N; j++) 
S: atiJCj] =a[i][j+2] + b[il; 



Figure 12: Example for typical aliasing 

Alias analysis can be more or less precise depending on whether or not it takes (he control-flow into 
account. When it does, it is called flow-sensitive, and when it does not, it is called flow-insensitive. 
Flow-sensitive alias analysis is able to detect in which blocks along a path two objects are aliased. As 
it is more precise, it is more complicated and more expensive to compute. Usually flow-insensitive 
alias information is sufficient. This aspect is illustrated inFigure 13 where a flow-insensitive analysis 
would find that/; alias b, but where a flow-sensitive analysis would be able to find thatn alias b only 
in block B2. 

Furthermore aliases are classified into must-aliases and may-aliases. For instance, if we consider flow- 
insensitive may-alias information, then* alias y, iff x and y may, possibly at different times, refer to 
the same memory location. And if we consider flow-insensitive must-alias information,* alias y, iff x 

25 
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.J ,roUShOUt A he of a procedure, refer to the same storage location. In the case of 

221 « C °?" ,der fkw ;« s «**«« may-alias information, /i ate* * holds, whereas if we 
Zto! ^d° s ^hTl V ht mU f a, T <*■ * *» not hold. The kind of information to 

°* r PI ° lem t0 F ° r ,nstaoce ' if ^ want to remove redundant expressions or 

SE^S^T". mUSt be USed> if W Want t0 buHd a data ^ndence^graprma^ 



Bl 

Int-p.bflOT]; 



< uses ofbandpi 



*p «= mallocQ; 




*p=mnllacO; 
<usesof b andp> 



B4 

<Uses of b and p> 



Figure l3:Example of control-ftow sensitivity 

MSSSS^ — variables 

fraction call where * is passed twice as parSSr * ' and y are abased through the 

void foo(int *i,'int* i) 
{ 

*i = *j+l; 

) 

£oo(&k,«k); 



Figure 14: Example for aliasing by parameter passing 

3.14 Interprocedural Value Range Analysis 

This analysis can find the range of values taken bv the variable* Tt k-u. . i_ . . 

dead code elimination, loop uMolline and uZrl JZEr h Ip to ^ piy °P t,rai2aI wns like 

of variables and then 55^«2LS!^Tl"- ? tb ' s J' ur P os * lt «■» «se information on the types 

This analysis has to be interprocedural as for instanca innr. . 

function, like in the followine example wl LZTZ! . bounds be Passed as parameters of a 
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void foo(int *c,int N) 

int i; 

for (i=0;KH;i++) 
ct±J - g(i,2); 

> 



if (N > 10) 
foo(a,N) ; 

else 

foolb,NJ; 



known assert function. 8 ^ ^mantles. This can be done by pragmas or a compiler 

3.1,5 Alignment Analysis 

can be viewed as a muW-dimeSSafarr^v fflS a iST ^ I-*" 3 * itS rea,iza « 0 n * hardware 
identify memory Nation* tha/Swayl S ^^Tf^^' ^ 
dimension of interest is memory banks, taifJ^SS^uiS^7 mm For .example, if the 
accesses the same bank". This is the case nft^l2 J^i ? rf * memor y reference- always 
[10], where all a^es/depfctedt Sue « to £ £ *"* «"» be f ° wd *■ 

the accesses are not aligned He ^ tf^Sf?«5?i ™ ^ *» the P*rt, 

compiler-controlled me^ryoptimSon^^ mfonnation ia useful in a variety Vf 

and energy consumption^ opt,m,zat,ons Ie * d "*g * improvements in programmability, perform2.ee. 
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Alignment analysis, for instance, is able to heln find » h.-^v 

furthermore useful for automatic data distribution tools JTlSSSS? " d * 

able to automatically generate alimmenr r 2" matlc alignment analysis tool can be 

sunplifi^mec^dSSpToC bfe^d^t b a P rocedure * us 

into account dynamic realignment WIth 30 '"procedural analysis taking 
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— 3l S be USed to apply Io °P aiwmwnt that transforms the code directly rather 
than the data layout m rtself, as shown later. Another solution can be used for the PACT Sp 

the aliVnnient «,.]„„•„ • ,? Arr ' ™ e rest 13 executed by the host processor If 

be inlSS * 1BlySlS 18 ^ f nter-piocedural or inter-modular) less conditionafcooX' to 



3.2 Code Optimizations 

W»t of th* optimizations and transformations presented here can be found in detail in [4], and also in 

32.1 General Transformations 

^Sdt - ?^ SaS2S£ ,ta " be appUed to ^forward code, and 

sequel of this docum^ ^ P * appear m * ""V"* but *V ™ mentioned m the 

Constant Propagation 

e „ 3? ' • (1-0, i<= 256; 

for(i-0;i <= N;i++ , 3[i] = bfi J + 3 ' 

= b[i] + o; 

Figure 15: Example of constant propagation 

Copy Propagation 

^S^^SS^^ moving redundant copies of the same variable in the code 
--tUiste^^^ 



i*d - 



F/gwre /5; Example of copy propagation 

Dead Code Elimination 

SVnX^^ ~ C^e is never executed if i, „ in 

loop body, whose number SS^Sl^ 6Valuated to *™ e « or if it is a 



29 



™~ 2003 16:12 p. p IETRL K 

Eeh'er! Unbetannfrs Schalteram,,,^ -wtiye^ummaty 



Code updating variables, that are never used, is also useless ™H M « 
soever used, then the code updating it m***SZSS£^ 

a[j] = b ] I a[i]; ^C=O;j<10;j ++ , 
for(j=.0;j<l 0 ; j++ ) L J } «t3+U - a[jj + 6 IjJ; 



j a W+« - a[j] + b[j].- 

tffewre /7; example ofdeadcode elimination 



Forward Substitution 



C = N + 1; . 

ate] = 6[ C J + a[i); atN+1] " blN+LJ + *W * 

Figure 18: Example oj forward substitution 

Idiom Recognition 

pis transformation recognizes nieces nf , n w„ . 

forfi=0; i<N ; { ^ 

c - «UJ - Mi] , for (i-O; i< N ; 

if (c<0) c ~ «£iJ ~ Mi]; 

d[i] . C ;' , - c; 

^igare /P.- £W^/ e ^«ff DOT recognition. 

3.2.2 Loop Transformations 

Loop Normalization 

bounds of the loops are S £!fa? Id h / ^«<»™ and the 

and ease inter-loop dependence analysk „d ft 3£ ^ IO ° P 6,31011 to fmd opportunities, 

nonnahzed loop to be applied. ^ aiyS,S ' 3X1(1 rt aIso enab ' e * the use of dependence tests that needs 

for(d=2; i^N; i=i+2) - , . „ 

afiJ -b[AJ, i<( N .2 )/2 ,- 

a[2*±+ 2 J a b [2*1+2]; 

Figure 20: Example of loop normalization 



- -• j. . nn /m /nnno 1 o . i n 
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Loop Reversal 

u^d iTcS^S i^^? 6 itBrati0n ^ ° f alo <* is 1* * usually 

it changes ttTSS^SST^ ' * ansfonnations ' »■ W interchange, because^ 

1 J b[1]; . a[i] = bCi]; 

Figure 21: Example of loop reversal 

Strength Reduction 

^^^^^^^^^^^^^ 

forCi=0; i<N,- i ++J = 

1 J c 1 ' for(i=0; i<N; i++) { 

a[i] •* b[ij +• t; 
t => t + c; 

} 

Figure 22: Example of strength reduction 

Induction Variable Elimination 

dependence cycles due to the update of the *^j££*£££Z ^ 1,1,9 a,so rem °ves 

? a[i] = b[±J + a[k]; > aU] = b[1] + a[k+(i+U*3j; 

k = k +(N+1)*3; 
ri&tre 23: Example of induction variable elimination 

Loop-Invariant Code Motion 

conducted in the reverse fashion in^rdJfS « JLj^JS^ ™ S can also be 

other optimizations. rt6r to nested. loops, that ate easier to handle by 

f«ti-0, i<N,- 14+, If(*>. 0) 

forfi=0; i<N; ±++, 
afil = b[i] + C ; 

F ^24-E X ampU v fl wp . l ^ icmtcodemonon 
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Loop Unswitching 

TOb transformation moves a conditional mstmctio* outside of a In™ h n A„ -e -«. * 

invariant The branches of the condition mZlT vl -°? body rf ,ts condltt °n is loop- 
original states of ^t^S^S^^^tS?^ \?? ^ ±B 
-^-trol-^^ loop by 



for(i=0; i<N r - i++J { 
a[i] - b[i] + 3; 
if (X > 2) 

b£i] = c [i] + 2; 

else 

f b[i] - C [i] - 2; 



if (x :> 2) 

fozr(i=0; 1<n,-' it+) { 
a[i] = b[i] + 3; 
bfi] = cti] + 2; 

else 

for(i=0; i<N; 

a[i] = bfi] + 3; 
b[i] - - 2; 

Figures*: Example of hop unswitching 

(f-Conversion 

This transformation is applied on loon hadlm* wi, ~~Jju- , • 

dependences into data de£ndenc£ Sti^^JS^*?* mstructions. It changes control 

conjunction with loop S!S^tXuO^ZS^^ ^ ft <*" be used « 

where array expression could ^ m^JEZ** 1Ma ' ^nditio^ 

predicated execution ^ ^ Pr ° CeS5 ° ra ^ 



fOr(i = Q;i < N; £ ++) { 

a[ij = afi] + b[i),- 
if (a[i] .'= 0) 

if <a[iO > c[i]) 
a[i] = a[i] - 

else 

afi] - a[i] + 
d[i] = a[i] * 2/ 



2; 

1; 



far(i = 0;i < N;i++) ( 
a[i] = alij + b[ij r - 
<=2 = (a[i] >«= o); 
•if jc2) c4 = (a[i] > c[iJ); 
if <=2 « o4) a[i] = ati] - 2; 
if (c2 &4 !c4 } a[i] » a{ ij + 1; 
d[x) = a[ij * 2; 



Figure 2£- Example ofificonverstt 



'on 



Strip-Mining 

* "SES^^ — - choose 

at compile tune, it can be nrt»»rSSr ^ i^ 011 COUnt 19 not 
constraints. It can be used in ^S^iiS^STSS^ "TJ 0 ? the resource 

interchange. It is also called Ioop SZ^cl^hJ^ ^ ^. L ,ke l00 P d***!™ or loop 
of strip-mining. P sectlonm ^' C * cle shnnkmg, also called stripping, is a specialization 



for(i=»0; ±<t«; 

a[i] = b[i] + c; 



«P = (N/16)*16; 

for(i=0; i<up. i = i ^ isj 

a [i; 1+16] = b[i.-i+i 6 1 + 
f9r(3=i+l;3<N;j++) • 

a[i) « b[i] + c- 



Figure 27: Example of strip-mining 
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Loop Tiling 

toKaStwrS 1^ S ^° fa ' OOP tt6St by intrcduci "S ,0 °P ^els to divide the 
The size of the tiles of the iteration space is chosen so that the data needed in each tile nt^^u 

?0rli r° ' L ?' for(ii=0; ±i = il+16) 

-EiMj] =.b l3l[l] . for(i=ii; i< minCii+ls'N); 

for(j=jj; j< min(jj+15,N); j++) 
a[i] [j] = b[j] [i] ; 

Figure 28: Example of hop Ming 

Loop Interchange 

I£ct^r p ^^^ - -de (depends on the sealed 

- enable veetorizaticn by moving inside an independent loop and outside a dependent loop, or 
.rnprovevectonzat.oa by moving inside the independent loop with the largest range, or 

■ deduce the stride, or 

- increase the number of loop-invariant expressions in the inner-loop, or 

am - Ml] ♦ HUM I, -W -?M?l WW! 

Figure 29: Example of loop Interchange . 

Loop Coalescing / Collapsing 

of dimensions ceMa^^SSJ^S^^ T" of ? oa,esoin S in which the number 
dimensional aamS£S^^Z^ ollapsmg reduces the overhead of nested loops and muh> 

increasingthe iteration rangTS fnnSmo^ ' * °" t0 mU Vectorizin S * 

- ' } B til CD J • ati] [j] + c; 

Figure SO: Example of loop coalescing 
32 
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Loop Fusion 



for(i-0, i<N; U,, foru=0; i<N; . 

a[ij = b[a] + C ; a[i] _ b[i] + c ) 

for(i=0, 1<N; ± ++) . d[±] = e[iJ+ c ' 

Figure 31 : Example of loop fusion 

Loop Distribution 



far(i=0;i<N; 

dCi] = e[i] +C; 



Figure 22: Example of loop distribution 

Loop Unrolling / Unroll-and-Jam 

WSS^ * — to •* * A ,ocp can be 

loop body bigge?, it also mSesUri^ or £FZ hy making ihe 

} a t i+1 J = b£i+l] + c; 

if ( (N-l)32) = i) 

a[N-l] - btN-ij + c; 

Figure 33: Example of loop unrolling 

Loop Alignment 

2~22SSffl^^ * *° ** » effect it to 

Parallelism fro m . , oop . ? tcan ose di E£SS - ^ ZT ?"* ^ t0 ™ re 
statements, to achieve its goal. This transfemMiorTZ ^ ^1, • Ppe ^ ,uls or introduce conditional 
enable this optimization b> -SSS^S^^UTf ™ COnjUn r clion wi * I°°P fUsion to 
accesses to army a become aligned Y 3 m both loo P nests - '» *e example below, all 
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F/gHre J<- Example of loop alignment 

Loop Skewing 

This transformation is used to enable DaralleHyat;™, i 

loop interchange. It is perfonnedTy «SS£S fco» £Z ""v r X** ta combina ** whh 
bounds of the inner loop variable and SS^ST .? multiplied by a skew factor,/ to the 
loop variable inside the lo™ **«w*iW the same quantity from every use pf & ^ 

for (1=1; i N; ... 

a[i) = a[i +j j + c ; "Tjr 1 '" 3 i+N; ^ ++ > 

*/g«n? 5 J- Example of loop skewing 

Loop Peeling 

for (1=0; ±<» N ; " 

cnj = atouN, + a[NJ m , ;is l ,is,r<sfi;"i + : J ai " 3 

a[i][NJ = a[0] [N] + aTN] [ni - 
tffgwre Example of loop peeling 

Loop Splitting 




loop peeling 



fox (1=0; ±<=N- 

- a[N-i + l] + C ; for(±=0;i< ( i,+ a ,/ 2; 



a[N-i+l] + - c; 
ford- (M+l)/2,i <= N;i++) 
a [11 = a[N-i+ij + c; 



Figure 37: Example of loop splitting 

Node Splitting 
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rHr'^r^. Htm* 

b"[l] = a[i] + tlfij; 
^ a[i+l] = b[i] * tZ[i] ; 

/^gwre J5; Example cfnode .■splitting 

Scalar Expansion 

be added. H 5calar 13 used a« e r the loop, compensation code must 

}• C ] aU] + c ' } «UJ - «t±l i anp[ij, 

c = tmp[N-l]; 

Figure 39: Example of scalar expansion 

Array Contraction ^ Array Shrinking 

for(i=0; i<N;±++) . ,. „ 

for(j-0, j<N;j ++ ) ( f°r(i= 0 ; : *<K ; i ++) 

twitj], = «ui j 3 * 3 . ( 27?: 3<N;j ++) ( 

Figure 40:.Example of array contraction 

Scalar Replacement 

^^Str^lS"^ "V ™ s ™* — - * 

«*« in coaj™cfi„„ wid, ,„,„, SMred asam "** *= ™«» ">°P. if it is modified. It can be 

.cti-'-u?:'^,,,,, s,y, ti jj„ lw 

tmp = trop + b[i] [j]; 
a[i] « tmp; 

Figure 4/.- fiawrpfe o/srcn/ar replacement 

Reduction Recognition 

Hus traosfoimation allows to handle reductions in loons A • 

3S 
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!^^^"^T^5 3 T SSible - °" ^ h * * vector 

then achieved bv ^S^ZS^^^^J^^ ^ " 
these results are summed, etc. with a tree, pairs of elements are summed, then pairs of 

for (1-0, i<N;i++) for(i»0; i< N; i= i+6 4) 

* = s + a [l); ror< ^ 10 ' 6 ? - t-^10,63] + «ri:± + 63], 

for(i=0,- i^64;i++> 
s = s + tmp[IJ; 

42: Example of reduction recognition 

Loop Pushing / Loop Embedding 

for(i=0; i<N; i++) f2 . . r r &. 

void £ (int. a , int j, { TO± l f ?. (i S t */ ) { 

Figure 43: Example of hop pushing 

Procedure Inlining 

This transformation replaces a call to a nrocedun. k« *k„ j — , _ 

procedural optimization. It allocs 5 SSSSJ* 9 itSel£ ft * 8,1 inte «- 

procedure call, and can improve locality parallelized, removes overhead caused by the 

for(i=0; i<N,- i++) n , 

f(a,±>, for(i=0; i< N; i+ +, 

a[i] = u[ij + C ; 

void f(int* x, int j){ 
^ x[jJ - as [J] + C ; 

Figure 44: Example of procedure Mining 

Statement Reordering 

. ( «i-.u-„-(, ( ;[« ; -£-11 

Figure 45: Example of statement reordering 
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Software Pipelining 



25 i—. of ^ 

conjunction with loop i^liKSXSrfS^T^i^ 1 P araI,e,is ' n - * «■ be used in 
another, each taking^™! -%cU ^fZ e is & P P ^ COmmands 1=311 b ° issued ° ne 
enough to actually bad Lm. iS^nS JE* * « is not 

the data. Execution of a co^raxi^l^LZ^L ^^ ™f 'f ,Wel *" 
waiting until all data J preset Sen J^'J^t ^Stna* 10 " is ^sued in a single cycle, 
Software pipelining overi Mm JTZ^^^JS^T^ .^^^ f ° r 
^.^Ap^r^^ the next 



Issue Cycle Command 



XPPPreloadConfig (CFG1) ; 
for <i=0; i<iOO,- { 

1: XPPPreload (2, 3+10*1,10) . 
Z-- XPppreload(5,b+20*i,20) ; 

4: // delay 
5: 

6: XPPExecute (CFG1) : 

Issue Cycle Command 

Prologue XPPPreloadConf ig(CFGl) - 
XPPPreload (2, a, 10) ; 
XPppreload(5,b,20); 
// delay 

for (i=i- Kioo; ++i) / 
Kernel 1: XPPExecute (CFSl) ; 

2: • XPPPreload (2, a +10*i, 10); 
J- XPPPreload (5, b+20*i, 20 ) ; 

XPPExecute (CFG1) ; ' • 
Epilog // delay 



Figure 46: Example vf software pipelining 

Vector Statement Generation 

tffeure *7.- fco^/e ofvector statement feneration 

3^.3 Data-Layout Optimizations 
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Scalar Privatization 

This optimization is used in multi-processor systems to ;„™. a. 

unnecessary communications between the ^SJSn^ 5 of parallelism and avoid 
^Poraxy^bkin^ bQd y^^l ^^S elements. Tf a scalar is only used like a 
its computations with this prtSJ^ * prOCessm S ^me« can receive a copy of it and achieve 

for (1=0; i <= N;i++j { 

C e. b[±]; 

Figure 48; Example for scalar privatization 

Array Privatization 
Array Merging 

of the arrays can be different fbr^i^^Kh 2?" ? ^you? 

accesses to array « are interleaved with acces^f ta a ™ a ^P 1 * ° f a c ™S5.filter, where the 
layout of both arrays where blocksT* fin^nTa^^ 6 W ^ to * re P rese °* *. date 
memory space is m white. Thus cache b£>» ^oLTS^t ^ ° f 6 (i " y e "^)- Unused 

for(j=l; j<=N;j++) 



figure 49: Example array merging 

3.2.4 Example of application of the optimizations 

fT'^'Jaa^ «T d before - d *• — ~ n of 

JX,^ rfa P">**» ■ -a I « area of ^SS^SSS^ T ™ 5 ° Iution ft " a « 

opjm.zat.ons that follows a reasonable S to ™^, ^l"" ^Propose a way to use these 
code, we can use the Alle^Kennedy alSm ^^S"^^ To vecto ^ 
before vector statements are generated Lt^ He^TZJ ^T reonlerin g *»d loop distribution 
index se, splitting, node spl f ni ^ « *J scalar JJLEj 

uepenuence graph. A statement can be v ec r™.£J , i £ 7_ ^^nnations are based on the data 
optimizations are performed to break wcteoTtfS* , T p8It of a dependence cycle, hence 
dependent cycles. CycIeS ,f not «n,plete ly possible, to create loop nests wM,"7t 

dataflow optimizations are applied to fte loL mLZ ^JIT™ ^ ™ e " so ™ high-lev^ 
code. ^ « 5tep _ w consist fc preparL^r^ 
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loop nests and ensuring that inner loop levels are vertnr;«i,i a -n. 

that target the architecture and op££ ^dSTSSJ ^' CM be P erfb ^ 
opt.m.zat.ons and code transformations can occj ° Uid ak ° be noted ftat 
further optima the loop nests, dlfFcrent steps that can also help to 

reduction and idiom ^^T^Si^^T^?^ loo P switching, strength 

can first a p pIy loop rever f aI I^S^^^J « of o,*^^^!^ 

aHows to build the data dependency gS Th e ™ £S " * 261 nonna,i ** I°°P nests. This 
^formations can he appUed. IWhStS tf diSE^^.^ ^ « * * 
peehng or loop splitting can be applied Node sdhS t™ £ " r ° n,y ° n Certain itera «^ loop 
reordering can be applied in other cases T Then in™ ? 3* i ° P skewm & wUar expansion or statement 
dep^ce cycles, fhe gcal is S^.S^SSS^T^ ?" '°° P Ieve ^°* 
cycles as much outwards as possible Then we ™Jv„S P ? Wlt, l the Ioo P leve «a carrying dependence 
rep.ace.nent/array contraction' and loop dM^S fUSi ° n ' l^ 00 

T^e L^l ment generati ° n ^ be P^nS af S S^^KS,* B h fo "° W i n3 ve ^ation. 

- te each loop nes , 

Heunsucs are used to guide the appHcation of t?3 ^ ^ S ° me " f *« applied 
needed. Let us illustrate this with 2S3k "i"™"*™* that can be applied severa , 



yo±<l g(int* a, in** c ,int i) 
j cfi] + 2; 



for(i=0; i<M;i++) | 
if (fc>0) 
else 

ati] tij = JbtiJ + 3 . 



So in more details, after procedure inlining, we obtain: 
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for(i=0; i<N;i++) { 
for(j=l; j<9;j=j++) 
if (k?0) 

a[i] [j] - ati] [j-l]-- b[i+i] M-ij. 
else 

= c[j] + 2; 

> 

^ d[i] = d[i+l] + 2; 

for(i=0; i<N;i++) 

atij [i] =» b[ij + 3; 

After loop unswitching, we obtain: 

if (k > b) 

for'(i=0; i<N;i++) { 

for(j=l; j<9;j=j++, 

a[i][j] =a[i][j-i] - b[i+i]'M-i]; 
d[i] - d[i+i] + 2,- 

else 

for{i=0; i<N;i++) { 

forCj=l; j<9;j=j++) 
d[jj = c[j] + 2; 
^ d[i] = d[i+l] + 2; 

for(i»0; i<N;i++) 

a[i) [i] m b[i] + 3; 

Alter loop normalization, we obtain: 
if (k > 0) 

for(i=0; i<N;i++) { 

far (3=0; j<S;j=3++) 

• <aU) - d[i+l] +2; 

else 

for{i=0; i<N;i++) { 

forCj=0; j<8;j=j++) 

d[j] . c[j+l] + 2 ; 
^ d[i] = d[i+l] + 2 - 

for(i=0; i<N;i++) 

a[i) fi) = b[i] + 3; 

After loop distribution and loop fusion, we obtain: 

if (k > Q) 

for(i=0; i<N;i++) 

for (3=0; j<:8;3=j++^ 
else a 'i][j+l] =a[i][jj -b[i + l] [jJ; 

for(i=Q,- i<N,-i++) 

d[j] = cfj+l] + zi 

£or(i=0; i<M;i++) { 

d[i] = d[i+l] + 2; 
a[i] [i] » b [i] + 3; 
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After loop interchange, we obtain: 

if (k > 0) 

for(j=0; j<8;j=j++) 
for(i>=0; i<;N;i++j 

else t3+1] " aC1] Cj] " bCi+1] 

for(i=o ; i<N;i++) 

for(j=0; j<8;j=j++) 

«= C[j+1] + 2; 

for (1=0; i<N;i++) { 

« dCi+lJ + 2; ' 
} a[i] [i] = bfi] + 3; 

After vector code generation, we obtain 

if (* > 0) 

for(j=0,- j<8?j=j++) 
else " a t° = f-l][j*l] - a[0:N-lJ[j] - b[0:N] [j] 

for(i=0; i<N,-i++) 

d[0:8] a C[l:9] + 2; 

d[0:N-l] = d[l:tfj + 2; 
afO:N-l] [0:N-1] _ b[0:NJ +3; 
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4 Compiler Specification for the 
PACTXPP 



4.1 Introduction 

constraints. Tile compile*, prlma^jeclSefe to co^lZ * b" to consider those design 

to innertoost loops and to make op as much data locality as possibJe^prthem 1 ex P ena ' Ve calculations 

2iSr5l^^ ^cedura, ana.y Sis , like alias 

globaj ^formation to l^SS^SmS^SL^ " neceBs£Uy to «■» *• P^Sation of 
influence the compiler. pnm,ZaWOns - 1116 fo "°wing sections concentrate on the way the PACT XPP 

4.2 Compiler Structure 

other ^ ps are briefly defcSef ^ ^ 011 * e ^ «*-P*r itself, but first fte 
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Code Preparation 



Partitioning 



| XPP 

Compiler 



RISC Code Gen. 



RISC Code Sched. 



Figure 5Q: Global View of the Compiling P, 



rocess 



«.1 Code Preparation 

many fcop „e S ts as^SiK °bt exeSS tfgSS^^^ c W B " r to - 

4.2.2 Partitioning 

™^Stc?^p Part ° f * e Prograi " * * *• host processor and ,*» part is 

A loop nest is executed by the host in three cases: 
■ if the loop nest is not well-formed, 
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XPP when the loop bounds are suitable. This would also ease applications of loop transformations, as 
possfcle compensation code would be simpler due to the hypothesis on the loop bound? 

4.Z3 RISC Code Generation and Scheduling 

After the XPP compiler has produced NML cpde for the loops chosen by the partitioning phase, the 
mam compihng process must handle the code that will b B executed by the host procfs^oTwhere 
instructions to manage the configurations have been inserted. This is the aim of tta itaKS "sSps: 

■ RISC Code Generation and 

■ RISC Code Scheduling. 

e^r^^IZ^ C °- de H?" ^ P roce9S0r Md * e «*» optimizes.it further by looking 

for a better scheduling using software pipelining for instance. y wotting 

4.3 XPP Compiler for Loops 

NML code generation and the mapping of the configuration on the PACT XPP. pamtlomnB P nase ' 



exit* 



yes 



toa maay &ils?j 



± 



XPP Loop Opt. 



X 



jaU &no chang e |[ tooMg)- 
no j." 



NML Code Gen. 



fail 




Mapping 



fail 




Temporal Partitioning 



yes 



Figure 5 1 -.Detailed Architecture of the XPP Compile 



First loop optimizations targeted at the PACT XPP am nrmKcA t„ *™ *~ a 

IhM can be „„ ^ «£^Ztt£%S££ZZ£2*£!!Z 
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PACT XPRfa this cJ^^^SS^^lSS!!^ ^-u will not fit on the 

of the NML code generation orTCmSnin. ^ P ^ respect to *• ^ easons of 

. change the code,Cp7xa] ^ 
attempts for the NML Code Sww^ m d m F"^™ 018 we kee P *«* of the number of 

do no P t obtain a ^^^^S^t^^^^ "TP m ^ md we stiU 
processor. procesa, and (he loop nest will be executed by the host 

4.3.1 Temporal Partitioning 

Temporal partitioning splits the code generated for the Pact vpp 

number of operations, i.e. the size of th7™G™L*Z £ J "» several configurations if the 
number of operations executable fa VjlS^ZZZZ* ? ta * ,0 ° P nest 8X1:66115 th ° 

dissevering [6]. These confifcS^ « tStaiSS^ ? ^ fwroatio « * called loop 
execution corresponds to ^^S^SS^ 9 ° f COnfi ^ ona "■»« -mber of 

4.3.2 Generation of NML Code 

2£ XPP Loop Options 

DAG-pattem matching techniques: ^ ™ ^ 0008 0311 111611 be P™"«=ed by using tree- or 

4.3.3 Mapping Step 

JS^sj^^j*^ is™ - *— - — - - 

4.4 XPP Loop Optimizations Driver 

The loop optimizations used for the PAPT vpo . 

parallelism as po Ss ib I e fiom Ae loop ^1?^^ described - ™* W»« « to extract as much 
ALU-PAeT as 4em « * e ^CT XPP by exploiting 

following sections explain how they £e LSn.w mem0ry b ^^cks with the IRAMs. The 
applying the optimizations. ^ ° rean ^ ed 3 " d how to ^ account the architecture for 

4.4.1 Organfeation of the System 

^JS^oCSS oXSo^^^ translations are divided 

called sever* times. Loops ^^^ZtZ^^^Tl'^ could >° 

each driver loop can be of constant^^dSr^ * ^f 1 - 11,8 of iterations for 

e-g. repeat until a certain code qShyls reihS ST V V^f 1 ' 6 ^ ^ * e optimizations itself 
loop nests are usable for the PACT XPP u^^inf S* 'S^' 00 ° f loQ P' ic ca * b * if 
instance if the loop nest is weU-fbrmed^na iedZ^LT^ to chcck *• bounds etc For 
but the loop bounds are unknown ™S T^&SiSSS^* n01 PWvent <**°****i 

that is easier to handle and can be b2£ opSnS an? ^°Z^ " ^f 1 to » inn ^ 
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™i™L S JH r PACT , XPP - Nevertheless this has not been necessary until now with the 
examples presented in the next chapters. 

Group I ensures that no procedure calls occur in the loop nest. Group II prepares the loop bodies by 
™£f ■ OP Tr an :l "J 5 ^^ ^ conditional instruction to ease *e analysis Group TTI 
ftnlf 5 "jS* f ° r , ^ *P«*«* Group IV contains optimizations to 

• S 'I ? £8t ^ de P endeoce that are suitable for vectorization. Group V 

contains optunuahons that ensure that the innermost loops can be executed on the PACT XPP Group 
™ ojmm^ons that further extract parallelism from the loop bodies. Group VII contain 
optimizations more towards opUmizing the usage of the hardware itself 

fa each group the application of the optimizations depends on the result of the analysis and the 
charges of the loop , nest For instance it is dear that not all transit!™ mlSupIv aS 
applied. It depends on the data dependence graph computed before. ^ 



4B 
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'Til' 



Group I — 

proceduro (nUning 
JwppiBhCn B 



Cruop II 

loop-Invariant code motion 
loop umvu'fchiBg 
strength reduction 
faanctfon variable elimlmSnn 



OroupQ 
loop reversal 
loop normalizatio 
_J f-CD nrerzi rm 




i 




Group IV 
!oup peeling 
loop splitting 
node splitting 
loop stewfng 
scalar expansion 
statement reorderim; 






Group V 
loop interchange 
loopdisrrftJutCcri 
loop collapsing 
loop tiling 
atrip- mining 
loop alignment 





Croup VI, 
loop fitaon 
reduction recognition 
scalar replacement 
loop unroniogfanrallA 


[am 


-i 


r 


Croup VH 

Data duplication 
Shift register synthesis 
Loop pipelining 
Tree balancing 



figure 52:Detail e d Vie* of the XPP Loop Optimization 



4.4.2 Loop Preparation 

The optimizations of Groups I, n and III of th H ttpp ™ :i„_ 

calls, conditional insrxui'a^ Action S 

nests, where the innermost loops are suitable fofiL?,i variables. Thus loop 

iteration ra„ ges w nn^^^tS^S^ f ^ aw ° btainecL ™ e 
transformations. dependence analyse and the application of other .code 



or 
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4.4.3 Transformation of the Data Dependence Graph 

dependence cycle, mat'would noSy p^ft ty v^Sot ** * 

optimization of a loop nest for the pact vpp 7f i • ^ of the code, does not prevent the 
tfat it won't P-int ^ * «*ld b, 

prevent valorization for the PACT wh e ^o • ♦ ^J- Furftermore dependence cycles will not 

^"vhol*^ 

(statement reordering), ° f date ^P^ence grap h 

by thePACTXPPantsomebyth^tt^ l0 °P ^ whi * - bandit 

4.4.4 Influence of the Architectural Parameters 

rfope^^^^ loop^fo^. ^ nufflber 

~« influence loop ^ l^^^'r^ttt^ 

^^^Z^^^^^^r^ rr ti ° a ° f * e ^-tions. For each of 
value the p-JSS^SfSS ^SS^^^TS r^ 00 ^ wbi * «" 
Vector length depicts the ranue of th^nitl ^ ^ ? e W 1 " 5 *" 1 " »f the optimizations. 
«h'h^lBli? M, loops, i.e. the number of elements of an array 
I/O IRAMs, ALU FREG B^a r ^presents the amount of data that must fit in the cache 

respective,/^ co„S *fpACT XP? ^"Z^ ° f ^ FREGs > "* 

operations that can be executed in pL^n^ ^ ™ ^ regents the number of 

^presents the tengwofSpSe P,pelmB S ** B - ^ S^pn height 

tbenumberofcyctdedipaXX^ 
" decrease a parameter's value (-), 

■ increase a parameter's value (+■), 

■ not influence a parameter (id), or 

- adapt a parameter's value to fit into the goal size (make fit). 

SSSSXri; & las- - — - - 



48 



16. is PRT.-RNW. P. PIETRUK ■ +49 721 469308 S . 54 
Fehler!UnnetanntesS<ftalter^ . j 



Parameter 



Vector length 



Reused data set size 



Goal 



IRAM size (256 words) 



I/OIRAMs 



ALU 



BREO 



FREG 



Data flow graph width 



Data flow graph height 



Configuration cycles 



Approx. cache size 



PACTsfee(16) 



FACT size (< 64) 



PACT size (< 80) 



PACT size (< 80) 



High 



Small 



— command line parameter 



Starting Value 



Loop count 



Algorithm analysis/loop sizes 



Algorithm inputs + outputs 



ALU opcode estimate 



BREG opcode estimate 



FREG opcode estimate 



Algorithm data flow graph 



Algorithm data flow graph 
Algorithm analysis 



XXr^^ Options Let* be fte totaI aumber of 

values |„ a cycle and out fe m'alZtl! ^f^Z 8 ^ PK ** majdmu m ™n.ber of input 
XPP. n is the number o^ALU * * ^ °* *" ^ 
ALUs, FREGs and BREGs S oaTbV^e^?^ «.*"" * COnS &™* io »> r is the number of 
amount to the number of ^JSS^^tZSS!^ f" 6 I*?" stage and,* end am 
of IRAMs yields ^y^^^^J^^ «>put port and I output port- the number 

regardless of address operations Tte number ^StZST the - n T ber ° f ° perands of * 8 instructions 
instructions regardles/of SJ^SSSl '^K'SS^i." ^ T"? ° f ° UtpUt of * e 

and output values, and the ^o^^^Z ^^^^^T^ ° f paraIlel 0 P Bra tion S , input 
the architectural parameters are now ^Sdfn deS ***** ° f ^"nation on 

Loop Interchange 

infixed byte IVO^teXTrSl 01 ?^?^^ ^?™£-*«».also? a 
loops to gel a more practical way to Y^. f^ Bab j''» data locality to interchange two 

of coarse a>so Ma/ncad w£Z£Z^£$^£< k °°*°" » is 
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Parameter 


cneet 


Vector length 


i 

T 


Reused data set size 


make fit 


VO IRAMs 


td 


ALU 


id 1 


BREG 


1 id 


FREG 


id 


Data flow graph width 


'id 


Data flow graph height 


id j 


Configuration cycles 





Loop Distribution 

Loop distribution is applied if a loop body is too big to fit on the PACT XPP rt, n,„- ~ - 



Parameter . 


Effeet 


Vector length 


id 


Reused 'data set size 


id 


VO IRAMs 


make fit 


ALU 


make fit 


BREG 


make fit 


FREG 


make fit 


Data flow graph width ™ 




Data flow graph height 




Configuration cycles 





Loop Collapsing 



so 
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Parameter 



Vector length 



Reused data set size 



I/O IRA Ms 



ALU 



BREG 



FREO 



Data flow graph width 



Data flow graph height 



Configuration cycles 



Effect 



+ 
+ 



id 



id 



id 
+ 
+ 
+ 



Loop Tiling 

Loop tiling, as multi-dimensional strip-minine is influent h« a ii 

when the iteration space is by far too big™ fin the iS P*™*™- * « specially usefol 

when the iteration space is unbounded (s^ Se^on i^w* guarantee maximum execution time 
respect to the resources of the PACT XPP, namely fte lRAM ^ tfa . en ,? nak ? * e Ioo P body fit with 
for strip-mining and loop tiling can becomJuSd Hke Sf* ^ ^ W ^ Size of * e tile * 

the final tile size is then the rn^ZXeZT^Zl "oXsZ/" T ^ST** elem ^ 
accessed w larger than the capacity of the cache, taJSEJt M ° f ^ 



for { 1=0/ i <- 1048576;i++) 
<loop Jboeiy> 



-for(i«0; i<= 1Q48576.- ± + . CACHE SIZE) 

for (3-O; 3 < CACHEJSIZE; j + =lBAM SIZE) 
for(k=0; k<lRAM SI2E;k++) ~ 
<tiled loop body> 



Figure S3: Example of loop titingjbr the PACT XPP 



si 
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rarameter 


EfFect 


Vector leneth ™ 


make fit 


Reused data cf^t c!7a 

" « lata pel SluS 


make fit 


I/O IRA Ms 


id 


ALU 


id 


BRBG 


id 


FREG 


id 


Dala flow graph width 


+ 


Data flow graph height 


+ 


Configuration cycles 


+ 



Strip-Mining 

64 ALU-FAEs which SS Tb?iSffiS toJST^ "Tf \r ble,n M *= PACT *» "»» 



Parameter 


Effect 


Vector length 
Reused data set size 


make fit 


I/O IRAMs 


id 
id 


ALU 


id 


KREG 


id 


FREG 




Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 





Loop Fusion 

Lopp fusion is applied when a loop body does not 
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Parameter' 



Vector length 



Reused data set size 



VOIRAMs 




FREG 



Dataflo w graph width 
Data flow graph height 



Effect 



id 



id 



id 



Configuration cycles 



Scalar Replacement 



Parameter 



Vector length 



Reused data set size 



I/O IRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Dataflow graph height 



Configuration cycles 



Effect 



id 



id 

Id" 



id 
Id" 



id 



Loop Unrolling 

* they modify tto sfee of th. lo?p STlW n«Th of" 13 ** inputs and c^ob rf w ~^J? 



i- — x — • x-no/m/onno ic.io _ 

r — •£ — • "tc~7 n nco 
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Parameter 


Effect i 


Vector length 


id 


Reused data set size 


t id *~J 


I/O IRAMs 


+ 


ALU 


+ 


BREG 


+ 


FREG 


+ 


Data flow graph width 


id 


Data flew graph height 


+ 


Configuration cycles 


+ 



Unroll-and-Jam 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


+ 


VO IRAMs 


+ 


ALU 


+ 


BREG 


+ 


FREG "'— 


+ 


Data flow graph width 


id 


Data flow graph height 


+ 


Configuration cycles 


+ 



4.4.5 Optimizations Towards Hardware Improvements 

Shift Register Synthesis 
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Sr^ reSiSteI ^ ^ OTdin S on * e ™nber of iterations it is alive, a value shares several rasters anfl 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OIRAMs 


j- id 


ALU 


id 


BREG 


1 «'«J 


FREG 


id . 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Input Data Duplication 

-<^^^^^ "-en* <* the same array are 

tRAMs. The advantage against SSrisLJ^i^^^ ^ VaI,?eS ^ CB P iBd in 
increased parallel^ 2He frSS^Sw,^ ITS lensth ' 3011 

bottleneck can affect to i-ribLSJ^-ffiT^"^ l«* *e cache-IRAM 

Nevertheless we assuine ^at^h^IRVM 22£ depend "l? ?" * e °f data to be moved, 

memory hierarchy cache- IRAM transfers are negligible to transfers in the rest of the 
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Parameter 


Eflect 


Vector length 




Reused data set size 




I/O IRA Ms 


• id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 1 


+ 


Data flow graph height 




Configuration cycles 


id 



Loop Pipelining 



Parameter 

Vector length 


Eflect "I 

+ 


Reused data set size 


id 


l/OIRAMs 


id 


ALU " 


id 


BREG 


id 


FREG 

Data flow graph width 


id 


Data flow graph height 


+ 


Configuration cycles 





Tree Balancing 
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Parameter 



Vector length 
Reused data set size 



I/O IRAMs 



ALU 



BREG 
FREG 



Data flow graph width 



Data flow graph height 



Effect 



+ 

Id 

Id 



id 



id 



id 



Configuration cycles 



4A6 Limiting the Execution Time of a ^figuration 

XPP is Iumted, and therefore its execution tniwi mnennos t loop that is executed on the PACT 



while (ok) { 
<loap body> 



while (ok) 

for(i-0; Kioo && ok; i ++) 
<loop body> 



ST 
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5 Case Studies 



5.1 3x3 Edge Detector 

5.1.1 Original Code 

Source Code: 

#define VERLEN 16 
#define HORleh 16 
main() { 

int v, h, inp; 

int Pi [VERLEN] [HORLEN] ; ' 

int p2 [VERLEN] [HORLEN] 

int htmp, vtmp, sum; 

for (v=0; v<=VERLEN-3; v++) { // ' 
for(h=q; h<=H0RLEN-3 ; h ++) f P 6 * 2 

htmp = (pX[^ + 2][h] - pl[v][h]J +• 

(pl[v+2][h+2] - pl[v][h+2]> + 

if ( n tni p < of [V+2Hh+1] -PlMlh*!,,; 
htmp => - htmp,- 

vtmp - (pl[v][h+2] -pi WlhJ) + 

(pl[y + 2] [h + 2] - pl[vf2][ h ], + 
if (vtmp; Sf 1 ^^ -Pl[v + l][h],; 
vtmp = - vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 
sum = 255; 
} P2[v+l][h+l] = sum; 

} 

for(h*=0; h<HORLEN; °* **** 3 

pna«(-%dVn-, p2 W [hJ>, // print output P i xels from p2 



5S 
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5.1.2 Preliminary Transformations 

Interprocedural Optimizations 



p"^^^ a- Action inHning and loop 

this example. W,m,n 11,6 Ioo P ^ *«e transformations are not applied to 



Partitionfng 



loop nest depth. Thus basic blocks which m S £7? ' C blocte m ">»otMcd with the 

c^rss^^^^^ 

Loop Analysis and Normalization 

ZSZSS.*' *"* — * is more ltely ^ ^ ^ 

for(v=J; V < VERLEN - 1; v++) { 
for(h=l; h < HORLEN - 1; h+ -M { 

htmp = - htmp; 
vtmp = - vtmp; 



sum = htmp + vtmp; • 
if (sum > 255) 

sum >= 255; 
P2[v+1] [h+i] = S um; 



} 



is well 



If the original loop induction variable is called / with ■ 

bounds 1 and u, respectively, t^the nl^i Z n 1 ^ e ? ent va,ue * and l 0wer and upper loon 
bound u.(the lower bound ^ 0 by dtfin^ 5? e upp ° e ? 
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• The upper bound calculates to u'=(u-l)/s. 

• ■ All occurrences of i are replaced by 1 + P * s. 

Applied to the code above, the loop statement f or <v=l; v < VERLEN i- vu u 

lower bound v = 1, the unoer bound V u-t4^k verlen - i. ; v ++) wrth the 

increment vs = 1 transform^' ( 15 meanS <= 14 in inte S er arithmetic) and the 

for(vn=0; vn <= jvu - vl)/vs; vn++) 
or simplified 

for(vn=0; vn <= 13; V n++) 

The 'h-loop' is transformed equally, issuing the original code. 

Idiom Recognition 

for <v=0/ v<=16-3; v++) { 

for(h=0; h<=16-3; h++) { • 

htrap - (pl[v+2][h] -pl£v][h]) +' 

(Plty + 2)jh + 2j ~ pl[v][h+2]> + 
2 (pl[v+2) [h+lj - pl[v] [h+11) • 
fctmp » abs(htmp); * 1 an 1JI ' 

vtmp = (pl [VJ [h+2J • _ pi t v] [h] + 

2 * (pi[v+i] [h+2] - pirv+xi rtin - 

Sum «= znin(htmp + vtmp, 2551- 
P2[v+1] th+lj , sum; 



J 
} 



Dependency Analysis 

for(v=0; v<=16-3; v++) {' 
for(h=0; h<=i6- 3 ; h++) { 



SI 



htmp - <pl[v+2]rh] - pl[v][h]) + 

(Pl[v+2j[h+2j - pi [v] [h+2]) + 



S3 



vtmp - {pltv] t h + 2] - pirvJth]) + 

(PX[v + 2] [h+ 2 ] -pl[v+2)[h]) + 

S4 .*. * (Pl[v+1] th+2] plfv+11 rhl ) • 

5,4 . vtmp - abs( vtmp) ; P J 1 J ' ' 



Empfansszei t 2Juli 16:22 



•aa.— j ui_— ^uigo io.^ei r-m _ 

. rm . minui. r. ricirsi-os 



S5- 
S6 




sum - minfhtmp + vtmp, 255); 
P2EV+1] [h +l] = sum; v$ 



«Pen^ Pipeune v^,^ The loop 

order of reads and writes. kcQ^S^^S^^^^! formation does not dSffifS 

remove the scalars completely. ^ ex P ression subshtution / dead code elimination will 



5.1.3 Pre Code Generation Transformations 

Forward Expression Substitution / Dead Code Elimination 

P2[v*l] [h+1J = TOin(abs( (pl[v+2Hh] . plM[h 

<pl[v + 2]r h+2] pl[vJ[ h+ 2]) + 
2 (pl[v+2] [h+1] - pltvl rh+ini 
+ ab«( (Pl[v].[h + 2J - pi [vH P h j V] + Ch+ln) 

2 (pl [vtl][h+2 j - pl[v+1]th]))/ 255); 
The scalar accesses then disappear completely. 
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Mapping to IRAMs 



am2[0]). The 



iram3[l] 



nu.n(abs(iram2[0] -iramO[0]) + 
(iram2[2] - iramO[2J) + 
2 * (£ram2[l] - iraraO[l]> + 
•abs(iramO[2J - iramO[0]) + 

(iram2(2j - iram2[0] + 
2 * (iraml[2) - Irani [0] ) , 255); 

Tree Balancing 

va, MbI timization before 

"i^CB^S^tK^Aj*^ PW«* -other 
add expressions can be interchanged to SducTthl i S^T^T COmmutativ * 




Figure 56 One of the sub frees before and after balance tu 
The resulting expressiontree is shown in Figure 56. 

5.1 .4 XPP Code generation 

Pipeline Synthesis 

As already stated the pipeline is svnthesiVeH k„ „ a 

sequential processors itdoes ^Z^^^T^S match ~ * * 

port connections. The main calculation nerS fs shown inT ^TJ* bUt PAE °P eod - ^ 
netwo* is not shown in this figure . Tne caT of ^SSJ ^^JEttSSSft 
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7^%%.^ "» ** consists of an IRAM for each inpu, ohanoe, i„ 

read different addresses concurrently* 31 porttd * 15 obvtaus »« * & "°« possible to 

XPPPcaload ( Conf ig) 
for(v=0/ v<=16-3; v++) { 

XPPPsreloadfO, Spl[v], 16) 

XPPPreload(2, &pl[v+2], 16) 

} nw-out. (conrig, IRM!( o )# 1MM(l)# IMM(2J# iRMf(3)) 

for shift register synthesis and like 
XPHPreload { conf ig) 
for(v=0; v<=16-3; v++) { 

XPPPreloadU, Spl[v], 16) ■ 

SS re ?- 0ad(3 ' & Pltv + l], 16) 
XPPPreloadM, fipl[v+i], is, 
XPPPreload{5, fiplfv+2] r 16) 
XPPPreload(6, &pl[v+2], 16) 
gPPreload(7, Spl(v+2]( 16 

XPPE^cutefconfig, Xham (0} , IRAM(l) , IRAM(2) IRWj , 3n 

IM*<4), immcs), SjlJ; J • 

> 

for data duplication, respectively. 
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Figure 57 The main calculation network of the edge3x3 configuration The MULT <Mpt^~.l- „ , , 
abs 0 calculation ythile &S(H^!S!S^JS^^^^ *" ** 
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X n 



*-C^3— □ 1 



OUl t»»ao t u I 



A D 




/wgwre J* Ctae //i/th/ qtfer shift register synthesis The ■ 

code. This comparison does not account for^he mS« p * b ° Ut 4000 Cycles for "»» 
of input value, is very small in this e^^^^^S^T^ * * ° bviOUS ' 4116 nusaber 
number. The XPP performance on the other -Si? ■ CaIcuIa ? on tl ™ * Proportional to that 
T.ereforetheXPPperform^i^ 



Parameter 



Vector length 



Reused data set size 



I/OlRAMs 



ALU 
BREG 



FREG 



Data flow graph -widih 



Pata flow graph height 



Configuration cycles (simulated) 



Value (shift register synthesis) 



16 

25? 



31 + 10 = 4 



27 



21 (j defined + 20 route) 



22 (9 defined + 23 route) 



14 



(shift registers) + 8 (calculation) 



Value (data duplication)" 1 



16 



256 



81+ 1 0 = 9 



21 



JO (1 defined + 9 route) 



1° (3 defined + 1 6 route) 



14 



configuration 
preloads 1 
cycles 
sum 



14*3*4 
14*57 



2262 
168 
798 

3228 



1 assuming 4 words/cycle burst transfer 



configuration 
preloads 
cycles 
sum 



8 (calculation) 



8*8*4 
14*52 



2145 
256 
728 

3129 
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5.1.5 Enhancing Parallelism 

core this lea ves p , enty D f room for ^Z^^^ST^^ ^T* 8 a XPP64 
enhancing para „ eIism w performed belta^ ^SSSto rf^T 3 ^ *" °P timizati °ns 
needed resources and the benefit of the tranXSSf P lace ' ,t ,! scruc,a, *at they estimate the 
^tforboth input prep^ they have to 

Loop Unrolling 

syndesis would exhaust mJ of the taStatf ffiSSTS ""J" Wlfc»Me.«u! shift register 

Unroll-and-Jarri 

toe n xpp 64 «^us/»^ 64 assuming XPP64) and calculates 

mo***. ^^- = 2 (integer division). 

"inner loop ~» 

Thus the source code would be transformed to. . 
for(v=0; v<=VERt,EN-3; v+«2) - { 

for(h«=0; h<=HORI,EN-3; h++) { 

.P2[vtl, [h+11 . min( ab5 1[v+2J[h] m 91 + 

abs(fpi [vJ[h+2 j _ pl[v][h]) + J " 
(pl[v+2J [h+2] - pl[v+2] th]) -i- 

(plfv+3j.[h*2) - pl[v+ll rh+?n * 

ab 3 ((pi [v+1 nh + 2] - piiv+inhj) + 
• (Pl[v + 3j[h+2j - P l[v + 3 ][h]) + 

^ } 2 ( P l[v +2] th +2J .- pUv+iith]!!, 255); 

means 2 IRAMs more for shift regbUTy^LS " * * ^ aCC6SS t0 AWAh] . 
duplication (4 input, 1 output), whf p^^eS doibS °" ** ' ™* ^ 
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Parameter 



Vector length 



Reused data set size 



T/OIRAMs 
ALU ~ 



BREG 



FREG 



Data flow graph width 
Data flow graph height 



Configuration cycles (simulated) 



Value (shift register synthesis) 



16 



256 



41+20=6 



45 



Value (data duplication - QO 
TRAM, placement) 




12 1+20= 14 



31 (12 defined + 19 route) 



29(1, defined + 28 ro ute) 
14 " ~ 



3 (shift registers) + 8 (calculation^ 


configuration 




2753 


preloads 


7*4*4 


112 


cycles 


7*53 


371 


sum 




3236 



37 



42 (4 defined + 38 route) 



18(1 defined + 17 route) 



14 



configuration 
preloads 



8 (calculation) 



7*12*4 
7*69 



2754 
336 
483 

3S73 



Parameter 



Vector len gth 
Reused data set size 



Value (data duplication - with 
IRam placement) 




C • 128 A 

"unroll-and.jiic, = ~^jT ~ 4 . 
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like (guards emphasized) S S t0 toe irmer Ioo P b °dy. Then the code looks 



p2[v+2] [h+1] . min( 



for(v=0; v<=VERLEN-5; v+M) { 

for(h=0; h<=HORLEN-3 ; h++) { 

P2[v+l]fh+l] = min( abs((pl[ v +2] [hj - 

(pl[v+2] [h+2] 
2 .* (Pl[v+2J [h+1] 
(Pi [v] [h+2 J - 
(pl[v+2J [h+2] 
2 * (pl[v+l] [h+2] 
abs< (pl[v+3] [h] - 
(pl[v+3] [h+2] 
2 * <pl[v + 3] th+'l] 
abs( (pl[v+l] [h+2] 
(pl [v+3] [h+2] 
2 * (pl[v+2] [h+2] 
absC(pl[v+4] [h} - 
(pl [v+4J [h+2] 
2 * (pl[v+4] [h+1] 
abs{ (pl[v+2] [h+2] 
(Pl£v+4J [h+2] 
'2 * (plfv+3] [h+2] 
abs((pL[v+5] [h] - 
(pl[v+5] [h+2] 
2 * (pl[v+5] [h+1] 
abs({pl[ v +3J. [h+2] 
(Pl[v+S] [h+2] 
2 * <pl[v+4] [h+2] • 



if(v>0) 



P2[v+3] [h+1] = mdLn( 



P2(v+4J [h+1] - mine 



plfv] [h] ) + 

- Pltv] [h+2] ) + 
" Plfv] [h+1] ) ) + 
Pl[v] [h]) + 

- Pl [v+2] [h] ) + 

- Pl£v+1) [h])), 255); 
Pl[v+1] [h)> + 

- pl[v+l] [h+2] ) + 

- Pl[v+1] [h+l])j + 
~ Pl[v+1] [h]) + 

- pl[v+3] [hj ) + 

- Pltv+2][h])), 255); 
pl[v+2][h]) + 

- PKv+2] [h+2] ) + 

- Plfv+2] [h+1] ) ) + 

- pl[v+2] [h]) + 

- pl[v+4] [h] ) +' 
" Pl[v+3][h])), 255); 
Plf>+3][h]) +. 
~ pl[v+3] [h+2]) + 

- pl[v+3] [n +1 ] ) ) + 

- pl[v+3] [h] ) + 

- pl[v+5] [h]) + 
- Pl[v+4][h])), 255); 



5.1.6 Parameterized Function 

Source code 

'S^T^S^S^. 2gJ ^ that fonn ih real world appHcations 
with the sizes of the picture* ™ T £™. ttaot,Ml Wlth Peters for input and output aSys Song 

Therefore the source code would look similar to: 

*Pl, int *p2, int HORLEN, 



void edge3x3(int 

for{voO; v<=VERLEN-3; v++) { 
for(h=0; h<=HORLEN-3; h++) 



int VERLEN) 



htmD 



if 



(v+2) 
(v+2) 
(v+2) 



(**(pl 
C**(pl 
2 *. <**(pl 
(htmp < 0) 

htmp a - htmp; 
vtmp = (**( p i + v 

(**(pl + (v+2) 
2 * (*v (pl + {v+1) 

if (vtmp < 0) 

vtmp « - vtmp; 



{ 

HORLEN 
HORLEN 
HORLEN 



h) 

h+2) 

h+1) - **'{pi 



**(pl 
**(pl 



HORLEM 
HORLEN 
HORLEN 



h+2) 
h+2) 

h+2) - **(pi 



** (pl 
**(pl 



HORLEN 
HORLEN 
HORLEN 



h)) + 
h+2) )+ 
h+1)); 



(v+2) 
(v+1) 



HORLEN 
* 



h>) 



+ 

HORLEN + 
HORLEN + 



+ 

h) ) + 
h)); 



sum 



htmp + vtmp; 
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if (sum > 255) 

sum = 255; 
**(p2 + (v+1) + HORLEN + h+l) 



sum; 



)} 



This requires some additional features from the compiler. 
■ interprocedural optimizations and analysis 

hints by the Programmer (e.g. a compiler ^knovm assertfVERLEN % 2 =^0^^ = « 
actually possib l e without peeling off iterations and ruZ^^oX^ UDr0lUand - Jaia 



Fitting the Algorithm Optimally to the Array 



Strip Mining Inner Loop 

HI ^'S^*^%££^?? <>»* a which is chosen to be of the 

upper bound must be 2 ^JtotoSS ^i*^ ° f C ° Urse stri P loo P s 

code would look like (outer v-loop negSd): ' nC ° mpIete last sm * Aitar the strip mmin g ^ origin P a , 

f °?iw?h-? t H0 ^ EN " 3 ' atrip,!,., 

htmp ( (pl + (v+2) „ HORLEN + hh, - * Mp i I T * HORLEN + hh) , + 



} 



} 



Parameter 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 
BREG 



FREG 



Dataflow graph width 



Value (shift register synthesis) 



16 



256 



41 + 20 = 6 



45 



31(12 defined + (9 route) 



DaTa flow graph height 



Configuration cycles (simulated) 



29 (I de6ned + 28 route) 



14 



Value (data duplication - with 
IRAM placement) 



16 



256 



121 + 20= 14 



37 



42 (4 defined + 38 route) 



18 (1 defined + 1 7 route) 



3 (shift registers) + 8 (calculation) 



configuration 
preloads 
cycles 
sum 



2753 

7*4*64 1792 
128*530 67840 
72385 



14 



configuration 
preloads 
cycles 
sum 



8 (calculation) 



2754 

7*12*64 5376 
128*553 70784 

6a 
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5.2 FIR Filter 

5.2.1 Original Code 

Source code: 

#define N 256 
Sdefine M 8 ' 

for (i = 0; i < N-M+l; { 
Sr y[i]. = 0; 

for (j = 0; j < M; 
S' : y[i] += oC3] * x[i+M-j-l); 

bZZSSg "* " " reP,aCCd by ** ^ by P-P— . The data dependence graph 




'(=.<) 



for (i = 0; i < 269; / 
S:- y[±] = 0; 

for (j ■=. 0; j < 8; 
S': y[i] += c[j j * x[ i +7 . jl; 

We have the following table: 
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Parameter 



Vector length 



Reused data set size 



I/O IRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Data flow graph height 



Configuration cycles 



Value 



269 



2+8=10 



a. Yb. 



5.2.2 First Solution 

f'SS OTfi^-? 0 ?. is » «? the inner .oop arid to use 
before as either they do not have an Iff^tanS^lL V?*™' N ° ° ther °P ti ™zation is applied 
unrolling, we obtato the following *" l0 ° P ° r *"* inCrease the need for »AWto. Afte? loop 



for (i - 0; i < 269; i++) 
yti] = 0; 

y[i] += c[0] 
• ytij += c[ij 

y[i] +~ C [2] 
y[i] •+= c[3] 
y[i] += c[4] 
y[i] • += c[5] 



{ 



yti] += c[6] 
yfi] += c[7] 



* 

* 
* 
* 
* 
* 
* 



x[i+7] ; 
x£i+6] ,- 
xfi+5] ; 
x[i+4] ; 
x[i+3) ; 
X[i+2]; 
x[i+l] ; 
xCi]/ 



Then the table looks like this: 
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F_ehlsr! Unbakanntes j^j^ffl ^jfagg^faggogggg 



rl = xll]; 
r 2 = x£2]; 
r 3 - x[3],- 
^4 = x[4] ;. 
rS = x[S]? 

r7 » X £ 7 j. 

for (i • o; 1 < 269; i++) { 

r2 = r3; 
r3 = r4; 
r4 *= r5; 
^5 - r6; 
r6 = r7; 
r7 = xCi+7]; 
} 




I^cS^ with reject to a standard superscalar 
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Parameter 


Value 


Vector length 


269 


Reused data set size 


_ 


I/O IRAMs 


2 


ALU ' 




BREG 


0 


FREG 


. 7 


Data flow graph width 


3 


Data flow graph height 


9 


Configuration cycles — — **— 


8+269=277 




Ops 


Number 


LD/ST (2 cycles) 


2 


ADDRCOMIP (1 cycle) 


0 


ADD/SUB (1 cycle) 


8 


MUL (2 cycles) 


8 


SHIFT (1 cycle) 


0 


Cycles per iteration 


28 


Cycles needed for the bop (2-way) 


(28*269)/2=3766 



Variant with Larger Loop Bounds 

Let us take larger loop bounds and set the values of N and Mto 1024 and 64. 
for (i = 0; i < 961,- i++) { 

y[i] = Oj 

for (j = 0; j < 64; j++) 
' .yti] +~ c[j] * x[i+63-j]; 

for (i »'0; i < 961; { 
y[i] = 0,- 

for (jj = 0; jj < 8; jj++) 
for (j 0; j < 8; j++) 
• Yti] t- c[8"jj+jj * x[i+63-8*jj-j] ; 

A subsequent application of loop unrolling on the Inner loop yields: 

fox (I - 0; i < 961; { 
yCU - 0; 

for (jj' « o; jj < 8; jj++) { 

y£i] +=. c[8*jj] * x[i+63-8*jj]; 
YU) •>•= c[8*3j+l] * x{i+62-8*jj]; 
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} 



■y[i] 
y[i] 
yfi] 
y[i] 
y[i] 
yd]- 



+= 
+= 
+= 
■+=■ 
+= 



c[8*jj+2] 
c[8*jj+3] 
C[8*jj+4) 
<=[8*jj+5] 
c[8*jj+6] 
=[8*33+7] 



* x[i+61-8*jj] ; 

* x [1+60-8*3 3] ; 

* x[i+59-8*jj] ; 

* x[i+58-8*jj] ; 

* x[i+57-8*jj] / 

* x[i+56-8* j j ) ; 



961; i++) { 



S-ZS^I.' T'S^T S^Ph ^bove, except to. to coefficients must be read from 

SXTiSS — • * ' he After shift 

for (i 
r0 = 
•rl = 
r2 = 
r3 = 
r4 = 
r5 m 
r6 = 
r7 = 
for (jj 

yLi] J ^ ^ = - ~ ■ c[8*jj+3]*r3 + 

+ C[8+jj+7]*r7; 



rO 
rl 
r2 
r3 
•r4 
r5 
r6 
r7 



- 0; i < 
x[i+56] 
x['i+57] 
x[i+58] 
x[i+59] 
Xti+60] 
x[i+61] 
x[i+62] 
x[i+63] 

= 0; jj < 8; 
= C[8*jj]*r0 

c[8*j j+4] *r4 
rl; 
r2; 
r3; 
r.4; 
r5; 
r6; 
r7; 

x[i+63-8*jj] ; 



jj++) 

+ c[8*jj+lj*rl + c[8*jj+2]*r2 + 
+ c[8*j j+5]'*r5 + c[8*jj+6]*r6 



■ 

Parameter 


Value , 


Vector length 


8 


Reused data set size 




I/O IRA Ms 


2 


ALU 




BREG 


0 


FREG 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Configuration cycles 


8+8=16 
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f 



Ops 



LD/ST (2 cycleg) 



ADDRCOMP (1 cycle) 
ADD/SUB (1 cycle) 



MUL (2 cycles) 



SHIFT (1 cycle) 



Cycles per iteration 



Cycles needed for the loop (2-way) 



Number 



10 



16 



17 



70 



(70*8)^=280 



5.2.3 A More Parallel Solution 

The solution we presented does not exoose a i«t *e n .- • 

parallelize the loop before we geneS X d ItT^VP ^ ^ We ^ ^ to explicitly 
means mote pressure oa the memory hierarchy ^ 0f course ^ing more parallelism 

.copied dependence i s the 

more suitable data dependence^ W?obiS Sen: W aPP,y 0Ode Sp,ittiQ 8 » 8* ■ 

for (i =0; .i < 24 9; i ++) ( 
y;U] .- 0;. • 1 

for (j = 0; j < 8; 

tmp ^ c[j] * xti+7-jl; 

) ' . ' • . 

^^SZSSgS* - *» '° — " *° «« W-* depend Mused ^ „ 

for (i .«= 0; i < 24 9; ■/'■• 
y[i] » 0; 

for (j . 0; j < 8 ; j+'+> ' 

• tmp Id] = c[jj- * X [i +7 -j]; 
^ y[x] +. tmp[j] ; J 



The parameter table is the following: 
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j Configuration cycles 



Then we apply loop distribution to get a vectorizable 
for (i =0; i < 249; f 
yfi] = 0; ■ ' 
for (j = 0; j < 8; j++> 

*»Pf5J - cf j] * x[i+7-jJ ■ 
for (j = 0; j < 8; 
^ yfi] +» tmpCj]; 



and a not vectorizable loop. 



Ih. Mow coupon* to Ae roo ^ , 00ps ,„ Mder „ ^ Mmpared wfth ^ 



Parameter 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Data flow gra ph height 
Configuration cycles 



Value 



249 



wo^Teed?^ 'C^ZgPSXS*- T ,0OP iS ^ ~» we 
of the PACT XPP. Hence we do not ££d to strir L corresponds to the number *r.S£j£ 
, loop is trivial, it does not need to be^^^HS^? VT* °° p - 1116 ~ of Ae *«S 
sum of a vector. This is easily found by thel^utdo^L v IO ° P ' S ' reduct,on ' h «»P«w the 
following code. ' y TOe ratao «<» recognition optimization and we obtain the 
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for (i = 0; i < 249; i++) / 
y[i] - 0; 

for (J = 0; j < 8; j++) 

t^PCj] - c[j] * x[i+7-j]; 



aux(j) = tmp[2*jj + tmp[2*j+l] ; 

/* accumulate the short vector */ 
for (j = 0;j < 1 7 j ++) . 

aux(2*j] » aux[2*j] + aux[2*j+l] ; 



Like above we give only one table for all 



Parameter 



innermost loops and the last instruction computing^. 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 



BREG 



FREG- 



Data flow graph width 



Data flow graph height 



Configuration cycles 



Value 



249 



12 



l*8+l*4+j*i=i3 



S^J2^*js* «- — of ^ w „ ^ , ess ^ ^ 



for (i 

. { 

tmp[0j = 
tmp[l] = 
tmp [2] = 
tmp [3] •= 
trop[4] = 
tmp [5] = 
tmp [6] => 
tmp [7] =. 

aux [ 0 ] = 

auxfl] = 

aux{2] = 

aux [3] = 



= 0; i < 961; 



c[0]- * 

c[l] * 

C[2] * 

c[3j * 

c[4] * 

e[5J * 

c[6] * 

c[7] * 

tmpfO] 
tmp [2] 
tmp [4] 
tmp [6] 



x[i+7j ; 
.X[i+6] ; 
X[i+5] ; 
x[i+4] ; 
x[i+3] ; 
xti+2] ; 
x[i+l] ; 

+ tmptlj; 

+ tmp [3]; 
+ 
+ 



tmp IS] ; 
tmp.[7] 



aux[0) = aux[0] + aux[l] • 
aux [2] = aux[2] + aux[3] • 

y[il = aux[0] + aux [2]; 
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We obtain then the following dataflow graph representing the inner loop. 




constant for each ALU. 8t4 due BdSa^iS, o?,h * ** CMfncie Ms are handled like 

reside in the cache. And^ tte nwi of If ^ **" a * lme *" *• *«• •a*'*' 

efflcien.iy. the configure*,? ctn ^JL^i^^^/^ 
.ready m the IRAMs. The parameter table is then mE. ™ 1 ' h< "" : TOtln S the data to be 



Parameter 


Valne 


Vector Jength 


249 


Reused data set size 




I/O IRAMs 


• 16 . 


ALU 


15 


BREG ■ 


0 


FREG 


0 


Data flow graph width 


S 


Data flow graph height 


4 


Configuration cycles 


4+96] 



Variant with Larger Bounds 

To make thethings a bit more interesting, we set the values of N and M to 1024 and 64 



for' (i = 0; i < 961; / 
yti] = O; 
for (j = 0-; j < 64; j++) 
y[i] += c[j] * x[i+63-j}; 



2ZZ£JSi££*£i' — • M ata "-. We **■ — » set a tnore 
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for U - 0; i < 961; { 

y[i] - 0; 

for (j - 0; j < 64; 

tmp = c[jj * x[i+63-jJ; 
y[i] += tmo; 

} 

) 

After scalar expansion: 

for {i m 0; i < 961; , • 

y[i] = 0; 

for (j • 0; j < 64; 

tmp[jj = c[jj * xfi+63-1]; 
^ yfij tmp[jj; 

> 

After loop distribution: 

for (i - 0; i .< 961-; ( 
y[i] = 0; 

for = 0; j < 64; 

tmp[;j] = cfj] * x[£+63-jl; 
for ( 3 = o,- j < 64/ j++) 

^ y[ij += tmpCJ]; 

> . 



of options that depends upon 
performed in parallel, if we keep rt^such 1^ operations would have to be 

only access 16 data at a time so t£2 of £TEJ7 P erf f «™ stnp-mming on the 2 loops. We can 
loops (as we always have £££ ^^SSi^J^ ^ e 64 * 2/16 for the 2 
for (i = o- i < 961 want to execute both at the same tune on the PACT XPP). 

f or (35 = o; ja < 8; . 
for (j=0;j < 8; j++>- 

for (j=0;j < 8; 
} y£l] += tmp 18*3 j+j ]; 

And then loop fusion on thejj l 00 ps is performed. 

•for (i - 0; i < 961; i++) / 

for (jj = o.- jj < 8 ; { 
for (j=0;j < 8;j++) 

tmp[8*jj+j] =, c[8*jj+jj * x ri + s-?_ft*^ ., ■ 
for (j=0;j < 8;j++) 3J *U+63-8* jj- 3 j , 

yfi] += tmpt8*jj+j] ; 

) - . 

Now we apply reduction recognition on the second innermost loop. 
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for (i = 0/ i < 961; ( 
tmp = 0; 

for (jj = 0; j;} < 8 ; 

for (j =0; j- < 8; 

tmp[8*jj+j] - c[8*jj+j] * x[l+63-8*jj-j], 

'^V^rf iVSV"" mem ° ry US±ng 3 3h ° rter — length V 
auxm = tmp[8*5j+2*j] + tmp [8* j j +2 * j+l] ; 

/* accumulate the short vector */ 
for (j = o,-j < 1; 

aux[2*j] = aux[2*j] + aux[2*j+l]? 

} 

And then loop unrolling. 

-for (1 = 0/ i < 961; ±4-+) 

fOr {jj = o; jj c 8; jj-H-) 

tnpr8*jj] - C [8*jj] * x[± +6 3-B*jj]; 

• tmp{8* 33 +3J = cf8*jj+3] * xfi+59_fi*^T 

tnq?[8*3 3 + 5] = C[8*jj+5] * xM+57-p*^i 

.tmp[8*3j+7] = c[8*jj + 7] * *£i+55-8*jj]; 

aux[0] o tmp[8*jj] + t m P [8*jj+i] . 
aux 1) = tm P [8*j3 +2 ] + ^[BMjlsi • 
au*[2] = trap[8*jj +4J + tm £ [ " ? ; 
auK f 3] = tm P [8* jj+6 } + tuupfi* j \ 

aux[0] . aux£0] + avus[l], - 
aux [2] - aux [2] + aux(3j; 

^ y[ij - aux[0] + aux[2J ; 
We implement the Innermost loop on the PACT vpp a:^u. -*t. 

HFO mode, and filled according?o * TaddressL^fi y C ° Unter ' ™ e TRAMs used in 

IRAM6 and TRAM8 contain arrty c nJSS^^^SL" 1 ERAM ^ IRAM2, IRAM4, 

contains 64 elements, that is each ^^A^l^fl^ 5 ^ COntai « «V* Array c 

128 elements for each IRAM. Ar^s dSw ^nT ^' 0 ^ 1024 dements - m * " 
address is constant. This constant used t?£ iL^^?!?"^ 35 * h ■ «™y and its 

final parameter table is the followm * 6 addreSS Cotmter of ^ configuration. The 
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Parameter 



Vector lengrh 



Reused data set size 



1/OlRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Data flow graph height 
Configuration cycles 



Value 



8 



16 



15 



. 0 



4 



4+8=* 12 



the previous one. As the 

before the configuration can begin the wmDuteS,^^ V ^ a Ioc ^transfers to achieve 

compiler when choosing the code ^3^^?^^ mUSt be ilrto acco ^ by 
solution that will bo chosen by the clmpSr ^ ThtS ^ ^ * at *• first ^on is the 



5.2.4 Other Variant 



Source Code 



for (i m 0/ i < N-M+l; i++) / 
tmp =0; 

for (j = o,- j < M . 
• tmp += c[j] * x[i+M--j-l]; 
^ = tmp; J J 



FIR filter as shown below. - n m fact the same c °de as the first version of the 

for (i = 0; i < NrM+1; i++) i 
tmp[i] = 0; . 
for (j - 0; j < M; 
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5.3 Matrix Multiplication 

5.3.1 Original Code 

Source code: 

#define L 10 
#define M 15 
#define N - 20 

int A [L] [M] ;- - 

int B[W] [N] ; ' 
int R[L] fN] ; 

main{) { 

int i, j, k, tmp, aux; 

/* input A {l*m values) */ 
• for(i=0; i<L; i++) 

for{j=0; j<M; . "• 

. . scanfC'fcd", &A[i][j]); 

/* input B (M*N values) */ 
for{i=0; i<M; i++) 

for(j=0; j<N; j++) 

scanf( n, %d", &B[i][j]); 

/* multiply */ . 
for-(i=0; i<L;i++) 

for O=0; j<N; j++) { 
aux =0; 

for(k=0; k<M; k++) 

R[iJ id J - aux; 
} ■ ■ • 

/* write data stream */ 
for(i=0; i<L; i++) 

for (5=0; j<N; j++) 

printf C%d\n", R[i][j]) ; 

5.3^ Preliminary Transformations 

Since no inline-able function calls are present, no interprocedural code movement is done. 

' '^J^SHtl^SS h^* mUl S pI ^ * r comment is * e ^idate for running 

SS^'i^^iS^. 1 ^ Ca " S - the ,0 °P ^ - d - therefore discard 

Dependency Analysis 

for{i=0; i<L;i++) " 
forfj=0; j<N; j++) { 
s * aux 0; 

for(k=0; k<M; k++) 
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« ' Br -w.f UX += A[il£k] * B CJc] tj]; 




(S3) 



Kg*™ 59 Data dependency graphfor matrix multiplication 

The data dependency graph shows no dependencies that prevent pipeline vectori^tf™ tk . 
earned true dependence from S2.tb itself can be handled by afeed^k of aux ^SSSSS. ^ °° P 

Reverse Loop-Invariant Code Motion 

To get a perfect Loop nest we move SI and S3 inside the k 1«o» tu~~k 

for(i=0; i<L/i++) 

for(j=0; j<N; j++) 

for(k=0; k<M; k++) { 

if (Jc == 0) aux = 0; • 
aux += A(i] [k] * B[k] [j]; 
if (k — m-1) REiJ[jJ = aux; 

Scalar Expansion 

for(i=0; i<L;i++) S ' 

for(j=0; j<N; 

for(k=0; k<M; k++) { 

if (k *a 0) aux[j] = 0; 
aux[j] += A[i] (k) * B[k].H];' 
^ if(k = M-1) R[i] [j3 „ aux(3]i. 

Loop Interchange for Cache Reuse 

need for optimization b^Z ^ll^J^^^ ? rows ' At ** sight there seems no 

Nevertheless this Mram^,^1E^2S£ * 300633 to over * column, 

shows that array R is l^^i^S^^^^'^^-^^hi^ 

n every j iteration, while B is accessed every k-iteration, always 
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producing a cache miss 2 . This leaves a possibility for loop interchange to improve cache access as 
proposed by Kennedy and Allen in [7]. H • cess as 




f/gwe 60 The visualized array access sequences 

- First the cost of each reference 3 in the innermost loop body is calculated to 

• " loop feferenCe d06S " 0t ° n ,00P variable <>f the (current) innermost 

" n^LT^ ^ * 6 referen °, e depends ° n the Io °P induction variable and strides over a 
noncontiguous area in respect of the cache layout «n aes over a 

— , if the reference depends on the loop induction variable and strides over a contiguous 
r^pecrively 111 N is ^ ^ « *e step size and b is the cache line size, 

- Second each reference cost is weighted with a factor for each other loop, which is 

■ 1 , if the reference does not depend on the loop index 

■ r *e loop count, if the reference depends on the loop index. 

- Third the overall loop nest cost is calculated by summing the costs of all reference costs 



■We neglect "aux" in this observation since we do not exoect it to he wr,w„ t„ „ 
uses outside the loop nest) expect Jt to bc wntIcn to or rea d from memory (no defs or 
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tnnermost loop 


R[i]D] 


A[i][k] 


B[k][jJ 




k" 




M 
b 


M-N 


A/ 

L-N + —L + M-N 
b 


i 




\-l-M 


1-M-N- 


L-N+L-M + M-N . 


j 




LM 


N M 

T 





innermost * 

outmost is i. T&i S4tr2rtptaSSJ.s The ° Uto ^ *« * < - *• 
for(i»0; i<L;i++) 

for(k=0; k<M; k++) 

for(j=0; j<N; j++) { 

if (Jc = 0) aux[j] =» 0; 
aux[jj t= A[iJ [k] * B [k] [j]; 
j if (k == «-l)R[i][j] = auxfjj; . 



cache line 



M 






performance. * °P ttrnizes *• cache-htt rate, thus improving the overall 
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Unroll and Jam 

*r reduction reception has faeen 

we obtain more parallelism by^mg SKSteTS ™ the 0then Neve ^s* 

the unn>H factor . This facto, i-ainly^^ ^ 

• # available IRAMs / # used IRAMs in the inner loop body 

- # available ALU resources / # used ALU resources in the inner loop 

In this example the accesses to "A" a nr i «t>»> a~ , 

they must be considered in the calculation tJ*.., Ztf"l l0 °S W ^ h wiU be enrolled). Therefore 
they can be subtracted from the wailabte IRAM?T?7 MX ? d R " do not de P*nd on k. Thus 

dominates by far. ' ^^tnt generated by the IRAMs therefore 

Having chosen the unroll factor we mwt trim i 



for(k=0; k<l; k++) { 

for(j=0; j<N,- { 

if (k=?0) aux[j] « 0; 
aux[j] += A [i][k] * 8[k] [j]; 
} if (Jc-M-l, R[i Hj] _ aiiyj. 

> 

fOr(k=»l; k<M; k+=7) { 

for(j=0; j<N; { 

if (k— »0) aux[j] = o- 

auxrj]. += Ari][kj * B[k][j J; 

for(j=0; j<N; { 

if (k+i^Q) aux[j] , Qj 
..fux[j] +-A[i][k+1] * B[k+l]M]- 

fpr(j=0i j<N; { 

i£ (k+2==0) aux[j) = 0; 

auxfjj += A[i] [k+2) * Bfk+2]ril- 

for(j=0; j<N; ( 

if (k+3==0) auxCj] = 0/ 
*!? X , C ? , „ + " Ati Hk+3] * B[k+3][j]; 

£ojr(j=0/ j<N.- { 

if (k+4=0) aux[jj = o; 
f^ K ^ 3 '/ = A »-Hk+4] * B[k+4][jJ; 
if M-l> RUlfj]'- aux[j]; 



increases the unroll factor. SS Snl ESSES? e,g • e BREG-PAEs also have an adder, which 
has to account for this in a production comber eXamp ' e * e unr ° n factor of «un» 
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} 

for(j=0; j<N; j++) { 

if (k+5— 0> .auxfj] " 0; 

auxfj] +»»[i][K+S] * BCk+5]M). 
} x£ (k + 5« M -l) R [iJm = aux[j j; ' 

for(j=0; j<N; { 

if (k+6=-=>0) auxfj] = 0; 

auxfj] + = A [iJ [k+6] * B[k+6] [j], 

} 

} 

some of these duplicates. Thus the codecs shorSS ^ ^ eHmination ca * ** Hd of 

for(i=0; i<L;i++) { 

for(k=0; k<l; J<++) { 

for(j=0; j<N; j ++) { 

if (k==0) auxfj.] = 0; 
j aux t3J +- A[iJ [k] * B[k]Ij ]; 

} 

for(k=l; k<M; k+=7) { 

for(j=0; j<N; j++) { ■ 

} auxCjJ += A (i] fk]' * B [k] [j], 

for(j=0; j<N; j+ + ) '{ 

} 3UX[ ^ += A[iJ[k + l] * B[k+1] fj] ; 

for(j=0; j< N/ { 

} au*[jj A[i ] fk+2] * B [k + 2][j.]; 

for(j=0; j<N; j ++ ) { 

} auxfj] +« Afi][k + 3] * B[k + 3njj ; . 

fOr(j=0; j<N; f . 

} aUX[ ^ +B> MiJfk+4] * B[k + 4][j]. 

for (j=0; j<N> j++) [ 

} auxTj] +- ACi][k+5] * Bfk+SJfj]; 

for(j=0,- j<N; 3 + + ) { 

auxfj] +=A[i][k+6] *B[k+6jril. 

} 

) 

^ j^tXiSK « SSS? ? fdCt ^ the f,ret iteration k - 

fit into one configuration, m^^SSSt^^^ 6 ^^ "» roIl - d -j^ fJtor S 
the k loop with variable step sizS we nisf t£ k loonTw Be T? * sh ,° uId be no problem to ™ 
statements. This yields k l °° pS agam 8,141 ad J«st the step s he and guard the 

for (1=0/ i<L;i++) { 

for(k=0/ k<M; k+- k<l ? 1 - 71 / 
for(j.=o ; . j< N; j++) { ' 

if (k<=~0) auxfj] «=» 0- 
t . if (k— 0) auxfj] +-A£i]fic) * B[k][j) ; 
for(j=0; j<N ; { 
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} 
} 

} 



if (k>0) aux[j] += A[i][k] * B[k][j] ; 
for(j=0; j<N; j++) { 

^ if (k>0) aux[j] +- A[ij[k+i] * B[k+l][j], 
for(p=0; j<N; j++) { 

^ if (k>0) aux[j] +-A(l]£k+2] * B[k+2][j]; 
for(j=0; j<N; j++) { 

^ if (k>0) aux[j] +=A[i][k+3] * B[k+3][j] ; 
for(j=0; j<N; { 

if (k>0) aux[j] +=A[i][k+4] * B[k+4][j); 

• for (3=0; j<N; { 

if (k>0) aux(3] += A[±] [k+53 * B[k+5][j]; 
for(j=0; j<N; j++) { 

1 5 f U ? [51 += AfiHk+e] * B[k+6][j]; 

if (k+6—M-l) R[iJ[j] -auxtj]; 



Now we can jam the inner loops and finally obtain 

for (i=0; i<L;i++) { . 
for(k=0; k<M; k+= k<l ? 1 : 7) { ' 
for (3=0; j<N; j++) { 

if (k==0)' aux[j] = 0; 

1Z ( ( k>0? } { aUXCjl += A[i]fk > * B[k][j ]; 

auxrj] += ACi] [k] * B[k] m ; 

aux[j] += A[i] [k+1] * Btk+llhl; 

auxfj] += A[i] [k+2] * B[k+2]M].. 

aux[j] +=A[i][k+3] *B[k+3][j]; 

•aux[j] +=A[i][k+4) ■* B[k+4][j] ; 

aux[j] += A[i] [k+5] *• B[k+5] Ml; 

auxtf] +=A[i][k+6] * B[k+6].[i] ; 
} xf (k+6=M-l) R[i][j] » aux[j]; 

} 

■} 

)• 

5.3.3 XPP Code Generation 

The innermost loop can be synthesized in. a configuration which u«m ia ioam r *u ■ 
one TRAM to temporary store aux and one IRAM forAe outout a™ ? F r mpUt data > 

pass the value of k toihe XPP to direct rhe^^ l ° 
62 show* the dataflow graph of the synthesized configuiatTo^ * * '"^ FlSUre 
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The following code shows the pseudo code executed on the RISC processor. 
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XPPPreload(config) 
for(i=0; i<L;i++) { • 

XPPPreload(0, &Afi](0], M) 
XPPPreloadU, &A(i][0], m) 
XPPPreload(2, &A[i]£0], M) 
XPPPreload(3, &A[i)[0], M) 
3tt>PPreload(4, &A[i](0], M) 
XPPPr©l©ad(5, &A[i][0], M) 
XPPPreload ( 6, &A[i] [oj, M) • 
XPPPxel Q adClean(15, &R(i] [0] , M) 
for(k~0; k<M; k+<= k<l ? l ■ 7, ( 

XPPPieload(7, &B[k][0], N) 

XPPPr«load{8, &B[k+lJ[0], N) 

XPPPr©load(9, &BCk+2][0], N> 

XPPPreloaddO, &B[k+3][0], N) 

XPPPreloadfll, SB [k+4] [0] , N) 

XPppreioad{12, &B[k+5] roi , N) 

»PPr e load(13, «B£Jc+6][0], N) 

XPPE X ecute(config, IRAM(O) , IRAM(I), IRAM(2), ZMM{3) 

Sfol' IRAMf5 >' IRAM(6), IRAM 7 
IRAM(8), IRAMO), IRAM(IO), IRAMM11 
} IRAM(X2), IRAM (13) , IRAM (15) , k)^ ' 

> 

of 200-300 peren, eomU^ a ^te^S co^ ^ VllttCS P™** ^v^S 



Parameter 



Vector length 



Reused data set size 
I/OIRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Data flow graph height 



Value 



20 



20 



14?+ 10+1 internal 



20 



26(8 defined + 18 route) 



28(4 defined +24 route) 



14 




preloads 



cycles 



2633 
10*3*7*5 1050 
10*7*15 1050 
(fc=0)112 + 
(k=!) 100 + 
(fc=7) 100 
* 10= 3120 
7853 
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5.4.1 Original Code 

Source Code: 

/* C-language butterfly'*/ 
#define BFLY(i) {\ 

unsigned char metric, inO, ml, decision; \ 

metric = ( (Branchtab29 l[i] - syni i) + 

(Branchtab29 2 [i] 3 ym2) +H/2-\ 
mo = vp->old_metrics [ij + metricT\ 
■ ml - yp->old_metrics[i+l283 + (15 - metric) "- V 
decision (mO-ml) >= 0;\ TOmc '-' v 

*p r >new_a*trics[2*±] = decision ? ml . m o ■ \ 
vp->dp->w[i/ 16 ] |= decision « ((2*i) S3 T?;\ 
mO -= (metric+metric-15);\ 
ml += (metric+metric-15) ; \ 
decision - (mO-ml) >= 0;\ 

vp->new_metrica [2*i+n = det , j5 ,j„„ ~ , . . 

. ■ n>:>*->w(i/i6] ,.d.i l8 ^^ 

inwpd.t.^itarbiwcvoid ^unsigned char syml, unsigned char sym2) { 
struct v29 *vp = p ; - 
unsigned char +tmp; 
int normalize = 0; 

for(i=0;i<8/i++) 
vp->dp~>w[i] = .0; 

^or{i=0/i<128;i'++) 
BFLY(i); 

/* Renormalize metrics * / 

if (vp->new_metrics[OJ > 150) {' 
int i; ' 

unsigned char minmetric « 255; 

for (i=0;i<64;i++) 

if (vp->new_metrics[i] < minmetric) 

minmetric = vp-> ne w metrics [i] ; 
for(i=0;i<64;i++) ~ 1 J ' 

vp->new_metricg[i] -= minmetric / 
^ normalize = minmetric; 

Vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics - vp->new metrics; 
vp->new_metrics = tmp ; 

return normalize; 

} 
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5.4.2 Interprocedural Optimizations and Scalar Transformations 

Since no inline-able function calls are present, no mterprocedufal code moveroent is d(Jne 

-py coaling and idiom 

Note that idiom recognition will find the ^LJSZT^n 7**"? . f ° T Conv <* ie "<*). 
as a comment? ™ w *ver the resulting computation cannot be expressed in C, so we describe it 

lnt n » P d.t V vlte 1 *i29(void *p, unsigned char sy ml, unsigned char sy m2> < 

struct v29 *vp = p; ' 
unsigned char *tmp; 
int normalize = 0,- 

Char *_vpdpw_= vp->dp->w; 
for(i=0;i<8;i++) 
*_vpdpw_++ = 0; 

char *_bt29_l= Branchtab29 l/ 
char *_bt29jj= Branchtab29~2; 
char *_vpomO= vp->old_metrics ; 
char *_vpoml.28= vp->old_metrics+128; 
char *_vp™= vp->new metrics; 
■char *_vpdpw=.vp-> d p->w; 

for(i=0;i<128;i++) { 

unsigned ch ar me tr ic,_t rap , m0 , ml , _ m0 , jd , decision, ^decision; 
metric = <{*_bt29_x+i * 3yml) +' 

C*_bt29_2++ * 3 ym2) + l)/2- 
j.tmp= (metrie+metric-15) ; f 
mO « *_vpom++ + metric; 

• ml - *_vporal28++ + (15 - metric); 
_m0 = mO - _tmp ; 

_ml * ml + _tmp; 

// decision = mO >= ml; 

// .decision .= _m0 >= ml; 

*_vpnm++ = min(m0,ml) ;~" / / „ * s . 

• *_vpnm ++ . min m o m m .i ); " ^ decision 7 ml : ra0 
_vpdp«[i » 4] T= (~5o >- mil /i ^ _<*«clsion ? _mi • m0 

. 1 ° ml > f* decision*/ « (72*ii £ 
} I (_m0 _ml) /*_decisionV « U2*iJ-l) all)" 

/*Renormalize metrics */ • • 
if (vp->new_metrics [0] .> 150) ( 

int i; , 

unsigned char minmetric = 255; 

char * vpnm= vp->new_metrics; 
for(i=0;i<64;i++) • 

minmetric a min (minmetric, *vpnm++); 

char *_vpnm= vp->new metrics; 
for (i=0;i<64;i++j ~ 

*vpnm++ -= minmetric; 
normalize =. minmetric; 
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vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics = vp->new metrics; 
vp->new_metricg = tmp/ 

return normalize; 



5.4.3 Initialization 



THe first loop (setting vp->dp.> w[ 0..7] to 2ero ) is most efficiently executed on the RISC. 

54.4 Butterfly Loop 



"25TSJ^ ^ BFLY0 «™ — > * of invest 

char : iram4= vp-> ol d_metr-ic S ; # ISSSS!?' 
char *irara5= vp->old aetriea+l»«- // Preload(4 ' 
•short *iram6= ?p->new neSiel? ?*« rBlo "iC5, 
unsigned long *iram?-~^-So I '/ XP **"*°«*<6. 



for the XPP compiler and needs 

Branchtab29_l, 128/4); 
Branchtab29_2 f 128/4); 
vp->old_raetries, 128/4); 
vp->old_metrics+i28, 128/4) \ 
vp->new_metrics, 128/2); 
vp->dp->w, 8); 



for(i=0;i<128;i++) { 

unsigned char metric, jtmp. 

metric 



mO,ml,_niO,_ml 



( (*iramO++ - syml) + 

<*iraml++ * syin 2) + 1)/2 . 
_tmp= (metric « 1) _i 5 : " ' 

mO = *iram2++ + metric;' 

* i ^ am3++ + (15 - metric) ; 
_m0 = mO - tmp; • 
_ml = ml + "tmp; ' 

*a.ram7[i » 4] 1= f m n s= i> t - rn0 '- ml) ' 

/ « ml) <<: <<2*i)-& 31) 

} 1 (_m0 >= _ml) « ((2*i+l)&3 1); 
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Btab29_1 
iramO 




syml 
■ iraml 




Btab29_2 
iramz" 




.sym2 
iram3 




oldmefrics 
iram4 



oldmBtrics+128 
iram5 




mO 




_m1 












newmelrics 
iramB 



cnt 

• 2i 











vp->dp->w 
iram7 
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Parameter 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 



BREG 



FREG 



Data flow graph width 



Data flow gra ph height 
Configuration cycles 



Value 



128 



61+20 



25 



few 



few 



11 



11+128 



to read the value once and J^^i^S^T^^^^ elimination a chance 
end of this inner loop. Loop FuTon rtfa 5?£v r *■ value » IRAM at the 

values set in the flm loo? toth T^f^t^TaZ^ *T ^ <* the « 

Loop tiling with a tile size of 16 aJso e iat£7e* 5^2' ehmm / tin S firsl *°°P together, 
new inner loop only runs from 0 to i« Vk7 J! * 5i expressions for the shift values: Since the 
not limiting the valued £ge° ny fSthlr ** ^ now finds ** *. * 3 J e*pre££™ 

All remaining input IRAMs are character (& hit\ c 

stream into four 8-bit streams JJKI to S^&lSl t^T^ t0 SpHt ^ 32 ~ bit 
every character IRAM. The merges could be SlJ 1 ^ 3 ^ and 3 mer S e s for 
unrolling is limited to »w»II&ig^^^SSSiS ,BI1 "T" 1 " 8 *• Ioop bod y- How ever, 
•already 1 6 bit based: unrollinf onJ^^M^M^ ^ M due t0 fact * IRAM * i 
unrolling further cannot increase ptaSTSilKJl ^ * Wnto 32 bits *» everv «9cto; 

eliminating one layer of mer^s ^ ^^1^^° b ° dy is ° nl * unro, Sd once 
*H<*s°™e 32-bit value f^ ^ each ^ ™° «** Wl 

The modified code now looks like (unrolling and splitting omitted for simplicity)- 
char *iramO= Branchtab29 l- // v Dt> » , 

char *iram2= BranehtaWsV- /, v=f! re * Oa * (0 ' Br anchtab29_l, 128/4), 

char *iram4 = vp->old metrics; ' /, "pSSif Branchtab2 *-2' 128/4). - 
•char *iram 5 = vp->oldji»trlo.; i 28, // llll relo J 1' Vp - > °^- nietri «< 12B/4), 
short *iram6= vp->new metrics; // £J2S« J 5' ^^"-^tries+iae. 128/4, ; 
^signed long *ira m 7=-V p - >dp 4w; P e J ' ^new^etrics, 128/2); 

// syml s. S ym2 are in IRAM 1 & 3 ""reload^, vp-> dp ->w, 8); 

for(_i=0;_i<8;_i++) { 
rlse= 0; 

for(i2=0;i2<32;12+=2) { 

unsigned char metric, _tmp, m0,ml,_m0,_ml ,- ' 

metric = '( (*iram0++ * symlj + 

(*iraml++ a _ 
_tmp= (metric « 1) -3.5; " ' 

mO = *iram2++ + metric;' 

"in" * i " m3++ + - metric) ,- 

__m0 = mO - _tmp; 

_ml = ml + _tmp/ 

rjse^ + rlL ( ? i ? <m S' ml) ^ 8) 1 ffiin < -1); 
rise - rise | ( m o >= ml) « i2 ~ 

, • 1 <_ ni ° « (12+1) , 
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*iram7++ = rise; 



The modified data flow graph (unrolling and splitting omitted for simplicity): 



Btab29_l 




sym1 




Biab29_2 




sym2 
Iram3 


iramD 




irami 




' iram2 






newmetrics 
■name 



iram7 
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^romng. the SPlittiDg netW<>lk *° b0tt ° m m ° St l6Vel merge iS ° mitted f0r each leveJ ^ 




Parameter 


Value 


Vector length 


128 1 


Reused data set size 




t/OIRAMs 


. 6I+20 ^ 


ALU 


2*24+4*3(splh>2Goin)= 62 


BREG 


few 


FREG 


few 


Data flow graph width 


4 


Data flow graph height 


H+3(split) 


Configuration cycles 


14+64 """1 



5.4.5 Re-Normalization: 

Minimum Search 

The third loop is a minimum search on a byte array. 

char *iramO = wp->new metrics; // xpPPrs>i^H in 

for(i=0;i<64;i++) ~ • xp PPreXoad (0, vp-> new _ metl - ics< 64/4) . 

rainmatric - min (miowetric, *ij:am0+4-) 



EmPf anssze i t 2 . Ju I i 16:22 



97. 



PHI.— HNUI. H. HlfclKLK 



+4y -fax ab^^iau 



Fehler! Unbekanntes Schalterarrpmont Ev rrt , t j Y ^^ T 



parameter 


Value 


Vecror length 


64 


Reused data set size 


- 


1AJ IRAMS 


1+1 ' 




1 


BREG 




FREG 


0 


Data flow graph width 


1 


Data flow graph height 


1 


Configuration cycles 


64 



SJT- , AN,^, Tree 'Zi^ZffZSiZSZ 



// XPPPraloadfO, vp-> ne w_m e rric S , 16, ; 



char *iramO = vp-> n ©w metrics; 
£or(i=0;i<lS;i++) 

xainmetric - min (tninmetric,- mi n( min (*iramO ++ , *iramO ++ ,,- 
lnin ( *iramO++, *iramO++) ' ) ) ; 



Parameter 



Vector length 



Reused data set size 



I/OIRAMs 



ALU 



BREG 
FREG 



Data flow graph width 



Data flow graph height 



Configuration cycles 



Value 



16 



• 11+10 



4*min 



3*shln+3*shm 



0 
4 



5+16 



*P-~ «•* enabling loop 

IRAMS = 8. Constant l-^S^STSrfSTSSS^ ^-""h*^ * 16 »AM» / 2 

merging expression: • W rebalancing reduces the dependence height of the final 



char 
char 
char 
char 
char 
char- 
char 
char 



*iramO= 
*iraml= 
*tram2= 
*iram3« 
*iram4 = 
*iram5= 
*iram6= 
*iram7<= 



vp->new 

vp->new 

vp->new 

vp->new~ 

vp->new 

vp->new 

vp->new_ 

vp->new~ 



.metrics; 
jnetrics+8 ; 
_raetrics+16; 
metrics+24; 
metrics+32; 
metxics+40,- 
mefcxics+4 8; 
metrics+56; 



// XPPPreload (0, 
// XPPPreload (1, 
// XPPPreload (2, 
// XPPPreload (.3, 
// XPPPreload (4, 
// XPPPreload (5, 
// XPPPreload (6, 
(/ XPPPreload (7, 



vp->new_marrics, 2 ) ; 
vp->new_metrics+8, 2> ; 
vp->net*_metrics+16, Z) ; 
^p->new_metrics+24, 2); 
vp->new_metrics+32, 2) ; 
vp->neu_metriCS+40, 2) ; 
vp-^ew^ecrics+^a, 2),- 
vp->new_nietrics+5fi, Z) ; 
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for (i=0;_i<2; i++) { 

minmetricO =• rain (minmetricO , min( 

minmetricl = min (minmetricl , mint 

~ minmetric2 « min (minmetric2 , min< 

minmetric3 = min (minmetric3 , min( 

minmetric4 =» min (minmetric4 , min ( 

minmetricS = min {minmetricS , min( 

minmetricS ■> min .(minmetricS , min( 

minmetric7 =' min (minmetric? , min ( 



) 

minmetric . mint min ( (min (minmetric_0, 

min (rainmetric_2, 
min ( (min (minroetric~4 , 
min (minmetric - 6, 



min (*iramO++, 
min (*iramO++, 
min (*iraml++, 
min (*iraml++^ 
min (*iram2++/ 
miri (*iram2++, 
min (*iram3++, 
min (*iram3++, 
min (*iram4++, 
min (*iram4++, 
min (*iram5++, 
min (*iram5++, 
min (*iram6++, 
min (*iram6++, 
min (*iram7++, 
min (*j.ram7++, 



*iramO++, , 
*iramO++) 
*iraml++)., 
__*_iraml++) 
*iram2++) , 
*Ar"am2++) 
*iram3++) , 
*iram3++) 
*iram4++) , 
*iram4++) 
*iram5++) , 
*iram5++) 
*iram6++, , 
*iram6++) 
*iram7++) , 
*iram7++) 



minmetric 1), 
minmetric~3, ) , 
minmetric_5) , 
minmetric 7) ) ; 



Re-Normalization 



Parameter 


Value . ""*" 


Vector length 


2 


Reused data set size 




I/O IRAMs 


81+10 


ALU 


8*4*min = 32 


BREG 


8*(3*shln+3 *shm>= 48 


FREG 


0 


Data flow graph width 


8*4= 32 4 


Data flow graph height 


5 


Configuration cycles 


8+2 



char *iramO= vp->new metrics ; 
char *iraml= vp->new"~metrics; 
for(i=0;i<64;i++) ~ 

"irami++ = -iraroO++'- minmetric,- 



'/, HI"? 3 " ' ( °' ^"^w metrics, 64/4, 
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Value ~™ 1 


Vector length 




Reused data set size 




UO IRAMs * 


2I+10 1 


ALU 


.1 


BREG 


0 


FREG 


0 . 


Data flow graph width" 


1 


Data flow graph height 


1 


Configuration cycles 


a. □ 



There are no loop carried dependencies. Since the data size •« *u ■ 

four times without exceeding the IRAM h^EaE. . Slze ls inner loop can be unrolled 

char *irart»0=a tr 



char *iramO= vp->new_metcics; 
char *iraral= vp->new metrics; 
for (i=0;i<i6;i++) { ~ 

♦ J^fStt Z *iramO ++ - Mwnetric; 

-*i ram0++ m i nmetric; 

*iramO++ - minmetric; 
*iramO++ - minmetric; 



// XPPPreload (0, 
// XPppreloadCleand' 



vp->new_me tries , 
vp->new_metrics , 



16) 
16) 



} 



*iraml++ 
*iraml++ 
*iraml++ 




Data flow graph width 



Data flow graph height 



Configuration cycles 



2(spUt)44*l(sub)l-2(ioin)= g 



K*tft^^ is now 

64 BREGs/ l2 BREGs = ? which St r^' J* COmpUted tiUn § si *> (unroll fector) is 
overhead. ' Wta * replaCed b * 4 ' smce same throughput is achieved with le SS 



char 
char 
char 
char 
char 
char 
char 
char 



*iraitiO= 
*iraml= 
*irant2= 
*iram3c= 
*iram4= 
*iram5= 
*iram6= 
*iram7= 



vp— >new 

vp->new_ 

vp->new 

vp->new] 

vp->new 

vp->new~ 

vp->new_ 

vp->new 



_ntetrics; 

.metrics; 

jnetrics+16; 

metrics+16; 

metrics+32; 

metrics+32; 

metrics+48; 

metrics+48; 



// xpppreload (0,vp- 
// XPppreloadClean(l,vp. 
// XPPPreload. <2,vp- 
// XPPP te loadCloiin<3,vp. 
// XPPPreload (4,vp- 
// XPPPreloadClQan(5,vp- 
// XPPPreload (6,vp- 
// XPPPreloadClean(7,vp- 



•>new 
•>new_ 

>new_ 
•>new_ 
>new_ 
>new 
>new 
>new 



.metrics, 4) 
.metrics, 4)- 
.mecries+16, 4 ) 
.me'tric3tl6, 4 ) 
metrics+32, 4) 
metrics-f-32, 4 ) 
_metrie:s+4 8, 4) 
metjrics+48, 4 ) 
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for(i=0;i<4;i++). { 
*iraml++ => *iramO++ 
*iraml++ = *iramO++ 
*iraml++ a *iramO++ 
*iraml++ = *iram04-+ 
*iram3++ = *iram2++ 
*iram3++ = *iram2-t-+ ' 
*iram3++ ■» *iram2++ 
*iram3++ = *iram2++ 
*iram5++ = *iram4++ 
*iram5++ = *iram4++ 
*iram5++ = *iram4++ 
*iram5++ = *iram4++- 
• *iram7++ = *iran>6++ 
*iram7++ = *iramS++ 
*iram7++ => *iram6++ 
*iram7++ = *iram6++ 



) 



minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 

minmetric; 



// first pipeline 



// second pipeline- 



// third pipeline 



// fourth pipeline 



Parameter 



Vector length 



Reused data set size 



I/OlRAMs 



ALU 



FREG 



Data flow graph width 



Value 



5I+40. 



4*(6(spIit)*4(sub)+6Qom)) = 64 



4»(6*shln+6*shni)=48 



Data flow graph height 



Configuration cycles 



16 



2(split)+4* l(Sub)+2(join)= 8 



5.4.6 Final Code 

Finally we arrive at the following code: 

struct v29 *vp = P/ - 
unsigned char *tmp; 
int normalize =0; 

// Initialization loop eliminated 
// for (x=0;i<8;i++) 
// vp->dp->w(i] = 0; 

// Configuration for butterfly loop 
char *iran«0- Branchtab29 1- 
char +.iram2= Braneheab29~2; // 
char *iram4= vp->old_m e tric S ; // 

-k"*.*^ 8 ™ 5 " vp - >old _ m etrice+128; // 
short *iram6= vp->new metrics; // 
unsigned long *iram7=-vp->dp-> W ; // 
// syml & sym2 are in I RAM 1 I 3 



XPP Preload (0, Branehtab29_l, 128/4); 
XPPPreloa<i(2, Branchtab29_2, 128/4); 
XPPPreload(4, vp->oid metrics, 128/4) 

™ Pre J° 4,i15 ' ^-^l-Z-eerica+xae, 128 /4 

reoad ' 6 ' vp->n ew _ metrics , 126 / 2 /, 
XPPP^eloadH, vp-> d p-> w , 8J . 
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for (_i=0;_i<8;_i++) { 
rlse= 0; 
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for (12=0 ;i<32;i2+=2) { // unrolled once 
unsigned char m e tri c ,_tmp, m0,ml,_m0,_ml 

metric = ((*iram0++ A syml) .+ 

(*iraml++ * S ym2) + i)/ 2 - 
_tmp= (metric « i) - l5; ;/ ' 

mO = *iram2++ + metric;' 
-ml - *iram3++ + (15 - metric); 
•„m0 = raO - _trop; 
_ml = ml + _tmp; 

*iram6++ = (min (m0,ml) « 8) | m i n( m0 m i|. 
rise - rise | ( m0 >= m i, i< { - m0 '- ral) ' 
} I (_m0 >=. _mi) « (i2+i) ; 

*iram7++ = rise; 



/* Renoraalize metrics */ 

if (vp->new_metrics [0] > ISO) f 
int i; 

Configuration for loop 3 
char *iram0= vp->new_metrics ; 
char *iraml= vp->new_metri C s+8; 
char *iram2= vp->new metrics+16; 
char *iram3= vp-> new _metric S+ 24 , 
Char *iram4= vp->new_metrics+32; 
char *iraro5^ vp->new- metrics+40; 
C v? 3r *?- ran,&= " vp~>new metrics+48- 
char *xram7= vp->new-metrics+56; 
for(x=0;_i<2;i++) { 
minmetricO 



// XPPE»reload<0, vp->n e „ 

// XPPPreloadU, vp-> new " 

// XPPPr eioad(2f V p_ >new " 

// XPPPr$load{3, vp->new" 

// XPPPreload(4. vp->nev 

// XPPPreload(S, vp-> n ew" 

// XPPPreload(6, vp-> new ~ 

// XPPPreload{7, vp-> ne w~ 



metrics, 8) ; 
metrics+8, 8) ; 



minrrtetricl 

minmetric2 

minmetric3 

minmetric4 

minmetric5 

minmetric6 

minmetric7 



rain {minmetricO 
= min (minmetricl 
= min (minmetric2 
! min (minmetric3 
' min (minmetric4 
mih (minmetric5 
min (mlnmetric6 
min (minmetric7 



_metrics+16, 
metrics+24, 
metrice+32, 
metrics+40, 
metrics +4 8, 
metrics+56. 



// 



min( min (*iram0++, 
min (*iram0++, 
min( min(*iraml++-, 
min(*iraml++, 
min( min(*iram2++, 
min (*iram2++, 
min( min(*iram3++, 
min(*iram3++, 
min(*iram4++, 
min (*iram4++, 
min (*iram5++, 
min l*±ram5++, 
min (*iram6++, 
min(*iram6++, 
f min( min(*iram7+'+, 
min (*iraRi7++ > 

"*n( minffmin^e^^^ 

n,innS! Itanmetric - 2 ' minmetric-3) 
min ( (min (nunn,etric_4 , minmetric~5 

minmetric is Writte n ^tui^ etric - 6 ' ntinmetric"? ) 

ritten to the output I RAM ~ 



min ( 
min( 
min ( 



} 



minmetric 



*iramO++) , 
*iram0++) )) ; 
*iraml++) , 
*iraml++) ) ) ; 
*iram2++) , 
*iram2++) ) ) ; 
*iram3+>) , 
*iram3++) ) ) ; 
*iram4+-H , 
*iram4++) ) ) ,- 
*iramS++) , 
*iram5++) ) ) • 
*iram6++) , 
*iram.6++ ) ) ) ,- 
*iram7++) , 
*iram7++) )); 



8) 

B) 

8) 

8)> 

8); 

8); 
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" C c£fr S ? rati n ft f ° r loop 4 ' ™etric 



} 



char 
char 



- - — ^.w V ^ -j , minu 

cnar j.ramO= vp->new_metrics; // 

char *iraml= vp->new_metries; // 

*iram2= vp->new_metrics+16; // 

iram3=« vp-> ne w metrics+16; // 

cnar vp->new~metrics+32; // 

char *S1 = V P- >new -^etric S+ 32; // 

cW v P->^metric 3+ 48,- // 

*iraml++ = *i ra mO++ - minmetric; 

*iramO++ - minmetric; 
*iramO++ - minmetric - 
*iramO++ 



is in an input I ram 
XPPPreload <0,vp-> new 
XPPPreloadclean (l,vp->new 



xpppreload 



(2, vp->new 



XPPPreloadclean (3, vp->new 



XPPPreload 



(4, vp->new 



XPPPreloadclean (5, vp-> new ""i 
XPPPreload 



(6, vp->new 



*iraml++ 
*iraml++ 
*iraml++ 
*iram3++ 
*iram3++ 
*iram3++ 
*iran\3++ 
*iram5++ 
*irant5++- 
*iram5++ 
*iram5++ 
*iram7++ 
*iram7++ 
*iram7++ 



minmetric; 
iram2++ - minmetric; 
*iram2++ - minmetric; 
*iram2++ - minmetric ; 
*iraro2++ - minmetric; 
*iram4++ - minmetric; 
*iram4++ - minmetric,- 
*iram4++ - minmetric; 
*iram4++ - minmetric; 
^iram6++ - minmetric; 

minmetric; 
minmetric; 



*iram6-++ 
*iram6++ 



*iram7++ = *VrZZl ' ^T*^' 
o.£-dmt3++ - minmetricj 

normalize = minmetric; 



tmp - *p->old_metrics/ 
vp->old_metrics = vp->new metrics'- 
vp->new_metrics = tmp; - raecrics ' 

'return normalize; 



* ~r " 'new 

XPPPreloadclean (7, *p->new' 
// first pipeline 

// second pipeline 

// third pipeline 

.// fourth pipeline 



.metrics, 4) 
metrics , 4 j 
metrics+16, 4) 
metrics+16, 4) 
metrics+32, 4) 
metrics+32, 4) 
metrics+48, 4) 
raecrics+48, 4) 



Performance Considerations 

• does not hurt as it still does reduce lead * t0 eXtremel y sho * vector lenX^SS 

top of the memory hierarchy to tne iSSThe 25? juration and the transfer time fi^ m ^, e 
outer loop that calls the function waskno^ S 355 ^ VeCt ° r len ^ h »» increased tf J£ 

5 
3 



Operation Cycti 

LD/ST 
LDI 

MOVE 
BITOP 
ADD/SUS 
MULT 
CJMP 

Cycles r 
Count 
Issue width 
Total Cycles 



T 
2 
1 
1 
1 
1 
2 
3 



"7~ 
8 
4 
4 

to 

20 



2 
1 



Min Se a rch No rm Setup 

1 
1 



Normalize 



"23" 
1 

12 



3 

~70r 

128 



"5 
1 



4480 
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2 

~ TT 
84 

352 



~4 
1 



1 

S4 
320 



Est RISC cycles 

S168 RISC Cycles 
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assume an average issue width of two which means that th„ e»ic/- 

in pa^l. The estaMe is „ * co„Z g " riTS^ ° Pera,i< " ,S 
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5.5 MPEG2 encoder/decoder 

5.5.1 Quantization / inverse Quantization (quantc) 

. between MPEGl and MPEG2 inverse quantizafon ftrthermore the encoder distinguishes 

RS^ZL*^ WWCh " a " ~ s inlining, since they do not use 

Since all functions have the same layout (some check* ™. • , 

quanting with a quantization matrix) J! concent ™ - °° P """"'"^ over ™acro block 
intra-blocks, since it contains all WtaSTtSS ^StS^"**"^ of 
oop bod, es are more complicated, but add no corner ™ . P £?J Un ? (llle n °Oitra quantization 
» already Wined, which is straight!^ sSt ^S IeXl ^ I ^*e source code the mpegl part 
&nct.on calls itself. Therefore the S?™ n J ftiJSfS * J?*?*. defined ^ ^ins^ 
definrtion. umpuer m,,nes " *W dead function elimination removes the whole 

Original Code 

void iquant intra (src d<s* ^ 

short *src,~*dstr ^ reCrqUant - TOat ' m ^ uant } 
int dc_prec; 

^signed char *quant mat; 
int mquant; ~ 
{ 

int i, val, sum; 

if (mpegl) { 

dattQ] . = 3 rc[0] « <3~dc_prec) ; 
for i<64 . 

/* mismatch control * / 
if ((val&lj=o && val!=0) 
val+= {val>0) ? -i : 1; 

/* saturation */ 

-MM - <«1»20«7> , 2047 ,<„!«_„.„ , , • 



else 
{ 



sum = dstfO] * srcfni /•> _i 

for (i=l ; t .ii 6 4; l:!x ] <K <3 - dc JP^); 



/* mismatch control */ 
if ((sum&l)=0) 
dst [63J ~= i; 

1 
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> 

Interprocedural Optimizations 

•nl-g the code for fc ^ caU ltp~ funTtion 
fo, (k=0; k^h^Bht*^ widt P ltht ^ function Therefore, treads 

. blocks rk*block_count+j] [0] =, blocks rk*hi v 

oiocks [k*block_count+j ] (0 J « 
for (i=i ; i<64; { C3-dc_prec); 

val = (int) ( blocks [k*block count +il ril*inh. r-, 
■; - ounr+ 3 J U J *intra_q[i] *a qua nt) /is, 



) else .{ 



J? CKS t k *block_count+j ] [01 « 
for (i=i ; i<64; { (3-dc_prec) ; JiLUJ 

v al = (int) ( blocks rk*hi 

1 blOCk._COUnt+jl ti] * i n t ra „ r • i 4. 

j JJ1 J intra _g fxj*mquant)/i6' 



) 

} else { 
} 

Basic transformations 



) 



Since global mpegl does not change within th« 

the j loop and produces two loop^Ss " ^ ,00 P etching moves the control statement outside 

for <k-0; k<»bheight*inb width- k++) t 

fO w, (j r 0 ^ 3<block_counf, { 

blocks k*block count+ii rm w, 

- ount+jnoj . blocks£k*block_count +j ] f0 J « 

f°r.a=l; i<6 4/ i++) { (3-dcjprec)/ ~ lUJ ^ 

-1 - (int) ( blocks fk*block_count +j] [i)*intr. cm* 
j JJ IXJ -* n «-ra_q[i) *mquant) /16; 

) 

else 

for (j= 0 ; j< oiock co 

•» - blocks [k*block_cou„t! 3 - r0J . bl OC k s[k * b lo k 

val = (int) ( blocks ri<-*wi~~i 

' OAOC,cs t k *block_count+i] [ii* int - r , •,«,„, 

} intra_q [i] *mquant) /16; 

} 
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Furthermore the following transformations are done: 
■ A peephole optimization reduces the divide hv i« ^ * , . 

not consider loop bodies containing divS for the^V ^ *" ™* iS eSSential since w * *° 
■ m Idiom recognition reduces the *r n ^o„«. ^ 

.dstti] =min( ffiax( val, -20*9)?%% the "^ration" comment to 

Increasing paraHelism 

Now we want to increase parallelism The ; ; 

inerefore it has a value ran»e rfi in «,;»u *u V, . . D,OCK _ c ount can only get the values a « n 
It is to say that the source code cnnr a ;«r - 

because the value calculi \%™£Z VST? ^ M ** iteration - This ***** has been don. 
and the control statement in the ^ 

processors. Although this does not preTenSi IT " P erfbnn *** Urease oT traSna 

After unroll and jam the source code looks lik* 

iterations moved in front) . l °° kS hke (onl y one the nests showed and the peeled first 

for (j= 0 ; j<block_count; j+=2) (' " 

• for i^A+if } - biOcks [ k *count +j+ l ][0 ] <;< (f^rec, . 

™X = {.int ) (blocks [ k*count+ j ] f i 1 * int „ • , , , + u . 

mismatch control */ 
if ((valsl)= 0 s& va ii=0) 
val+= (val>0) ? -1 : i; 

/* saturation */ 

»al - (tntMblocks(klooimtt:jtij 

/* mismatch control */ 
if ((valsij= 0 &5 vali^O) 
vaa+» (val>0) ? -1 . i' 



/* saturation */ 

Mock st k^ 0 u, t+j+1Jur , min(max{vai . 2Q47); 



^r^£S^ - - ^ndence cycJes 

distmct blocks of data. Thus the i loop s ^ nto z^m? ^° su ^nfiguStbn/workt 

*e data at the same time. P m m ° 2 07 more ,00 Ps which work on different subsets of 



sub-configuration is chosen as a workine rirl* F nr ^ « 
not mterfere. Workln S ™ G ^ configurahons which contains independent 
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iummary 

Handling the data types 

rypeTS ng ° S^^^^Si^, 1 ^ 1 ^ benchmarks, which all use data 

assume that the sizes of char, short and int are 8 16 ™? w . com P arab l« architectures, W e must 

bit w ldth of 32 we must take precaut^ Assuming that the XPP hTa 

Therefore we split the stream of data packets with . 

data ^ int0 2 0r 4 streajns Kenou^ 2 ° r 4 Va,U6s of the Sorter 

Each of the divided streams is sent to its SSKSSSS 1 ? *? ! n ° P^omiance penalty 
or four char values are handled. NevertheS mt ,f ^ ***** ia cycle twoThori 

merge elements, the whole data fl^m^^^T*,." 1 be ^e besides tte iS? 

behind the calculation branches. The legality of rhTt™„ci ° part by smft operations and merged 

factor a, ois . ta J*** SLSSSSlSte J£* '°° P 
Unfortunately this is not the end of the Dole Th» ., ^ 

result which produces an over/unde Sow L i\t sho^^ haS to assure «*y intermediate 
type. Therefore it has to i nsert cBprtS^t^S^*''' Ae same with 2efc2S?d£ 

16or8bitvaI Ue , respectively. " Ppu,s °P erat «*« which assure that the network calculates wS, reS 
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5.&2 Inverse Discrete Cosine Transfomiation (idctc) 



Original Code (idctc) 

voiridctTSK^r' inV6rSe diSCrete transfer, V 

short *block; 

( . ' 

int i; . 

for (i= 0/ i<8; i++J 
idctrow(block+8*i) ; 

for (i= 0 ; i<8; i ++) 
ldctcol (block+i) ; 
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xidctrow 8xiejctcal 



result 



/* column (vertical) idct / 



* -8 2 ^ J 

, WhSre: C C0] - 1/1024 
. +/ c[1..7J = (1/1024) * sq rt (2) 

static void idctcol (blJc) 

short *blk; 

{ 

int xO, xl, x2, x 3 yd - 

' XJ ' x4 ' x 5/ x6, X 7, x8; 

/* shortcut */ 

if , "< ( 2:S?^jrii. l . , S,7.?«-«» < 



xO 



(blk[8*0]«8) + 8192, 



/* first stage */ 

x8 = W7*(x4+x5) + 4- 

X4 = (x8+(Wl-W7)* X 4)»3; 

« - (S8-(W1 +W7) *x5)»3; 

x8 - W3*(x6+x7) +4. 

X6 = (x8- (W3-WS) *x6) »3; 

«7 c. (x8-(W3+W5)*x7)»3; 

/* second stage */ 
x8 = xO + xl; 
xO -= xl; 

Xl = W6*(x3+x2) + 4/ 

*2 = (xl-(W2+W6)*x2)»3- 

*3 » (xl+(W2-W6)*x3)»3,- 



Cm n -f o n tr m n I + 0 lull 1 ft • 0 0 
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irirnury- 



xl 
x4 
X6 
x5 

/* 
x7 
x8 
x3 
xO 
x2 
x4 

'/* 
blk 
blk 
blk 
blk 
blk 
blk 
blk 
blk 



= x4 + x6; 
-= x6; 
= x5 + x7; 
•= x7; 

third stage */ 
= x8 + X 3; 
-= x3; 
= xO + x2; 
— x2; 

* (181*(x4+x5)+128)»6; 
= (181*(x4-x5J+128)»8; 



fourth 
[8*0] = 
[8*1] = 
[8*2] = 
[8*3] = 
[8*4] - 
[8*5] = 
[8*6] m 
[8*7] = 



stage 
iclpf 
iclpf 
iclp[ 
iclp[ 
iclp[ 
iclpf 
iclp[ 
iclpf 



*/ 

(x7+xl) 
(x3+x2) 
(xO+x4) 
(x8+x6) 
(x8-x6) 
(X0-X4) 
(x3~x2) 
(x7-xl) 



»14] 
»14] 
»14] 
»14] 
»14] 
»14] 
»14] 
»14] 



HI 
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^ s -- b d ^y ^ prep™ The ic,p array is 
called the first time: ' S fully def1ued b y *e initjdct function before idct is 



void init idct O 
{ 

inc i; 

iclp = iclip+512; 
for (i= -512; K512; 
iclpfi] = (i<-256) 



? -256 



((i>255) ? 255 : i); 



realized efficiency on the XPPtfth7cLn»/'l er W " fi,nCtion 4,131 «■ *» 
aliasing ana lys i s lt is able J rep lace ihS rfS^T ^ 
compter known fiinction. AhSSShrTd^ P *• ^ of *• 

accesses manually by the comnil^ P6r **" re P lace Ae 'Clp array 

illustration shows a possible Tit J? ^ Sa ?™ ion calls. Thfe 

schematic using two ALUs ^m^JT 0 * ** * M *«M> as N!v£ 

l*e fclpH by saturate(i 7 256). necessary to replace array 

accesses 



val 



SORTu 



ga 



A © 

"SORT 

x V 



The/* s * saturate(vai.n) 

™lUo™T^ctor?°££ ^JFtU^A**"^ UP * Xl * * 7 * »-o. This breaks the 
cannot be applied. But noneSs - Aelode 'aS^ * f* ,0 ° P ^ toop^SSSS 

possible to synthesize if-c 0 nditions fi?^?££5^ hand,in S * well suited ibr ttoXJIuf 
based condM<m) ^ ^ ^ ouid or «Je»P Og Posing of bo* blocks plus^cSon 

/Shortcut*/ code in idctrow and idctcol h«*f iT ? benefit. Therefore the 

shows theiniined version of the ^Z^XZ^^t code ^Pi« MoW 

void idc t (block, ^^^^^ernstructionsforXPPco^l: 
short *block,- 

{' 

int a; 

^r a Xoad(i DCTR0W _ C0NFIG); „ Loop invariaftt 

for (i=0,- i<8- ; i++> { 
short *bllc; 

Si -^l^k,* 3 -. * 5 ' " 6 - * 7 - -> 

XPPPreloadfO, blk, 8) ; 

_ * 18 erased and assigned to blk 

. XPEE -^te(IDCTROW_CONFIG,I RAM (0,,I RAM(ln . 
} ' 
for (i^O; i< 8; { . 



As the configuration of the XPP dop 

moved outspw^n," ISmSS^S ^ p , °° p exeou,iOT «~*« «* — - 
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NML Code Generation 
. Data Flow Graph 

produces the following results: ° eeds 0n the *» example the heuristi 



ic 
c 



Res. lefT" 
Res. avail. 



ALUs FREGs 



BREGs 



- 80" ?r 

64 . 80 . go 
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Address Generation: 

To fully synthesize the loop bodv we hav* tn * 

^a. • P D ° dy We have * ** Problem of address generation for accessing the 

For IDCTCOLUMN CONFIG we have to select the n* ^ 
of every row which means an address serial o^ ll 16° 1^ 

lower byTe. £J Sam oaLT f incre . ments * eight and the 

of IDC&OLUMN <K ) roTe^Lf ^ 
available SWAP is m^wi a ?. elements of a column are 

- a;r flow **« 

order (row after row as it hZ to Z ^l'" a PP r °P™te 
SrUP( StepP ed iterative T^Z^ *^^^ 
XPP tutorial it is Dos^.hiVV ,■ os as desc nbed in the 
NML-codeina^ to 

two-dimensional array Z Z«d ™ ^ MUp 5? a 
corresponding NML code Th* i si y p ' 00 »a^ in the 
accessed row aft« ^wTo tL u^L ^ elCTOCnt5 haVe to be 
the lower counte^l^^^ ,nCrem<Snt iS ° ne ^ 

for this access pattern (0 5 6 7 If ^T"?' NML code 
single counter (or to FlFO-mode accr).^ t0 0ne 

Further Enhancing XPP Utilization 

When we look at the data flow . . 

Problem (Pipeline Depth) 

^ e SK — _1 

Fin* the units at P ^ eSSed * EST 

unused. P'penoe are idle and then the units at the begin Pfrelf 

• t <DLE 




IDLE 



Pipeline Death 



11S 
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Solution (Loop Tiling) 

P^<™ ™ ^ column 

several blocks of the image is has no tooTtatehStS^n?! XdCt ^ on <*» ^orm.c) on 
can be moved inside the loops of column IS^^SJSST^ d * autan ^ this loop 

// transform. c 



o 

• o 
■a 

3' 
m 
3 

S" 

<g 

fl» 



for (n-0; n<block_count; n++) { 

xdctfbloctot^oclc.caunt^,, // bXoc^nt is 6 or 8 or 12 



} 



// idct.c 

^tTidSESS?* 1 inverse discre « ««=.«*■■., 

short *block; 
{ 

int i; 

for (±=0,- i<8; i++) 



idctrow(block+8*i) ; 
for (i=0; i<8; i ++) 



^ idctcol (block+i) ; 

Sss^ data (by app , ying loop tiIing) ^ 

Constraints (Cache Sensitive Loop Tiling) 

We fff the " Umber of that will be 

IDCTCOLUMN_CONHG oorfgSSSl ^ST; 7 J""* the Same blo <** ™ the subsequent 

«« so that the processed data fits Sc X cache^* t,lmg *" » be ^ ™* "V* 

IRAM reuse between different configurations 

fo^~^^ Input (RAM 

accessed) we can £vZfyb^te£&*"- ^ memoiy is 
*e output IRAMofrnV.fi J a ^ memory interface by using 

P« ikam. of Config A as iaput IRAM of Config B. sharedjRAM 

Putting all together 



8x8 



|Bx8}8xi 



Config 
A 



lie 



Output IRAM 



8x5[ix8]Q[8x8ll 



Config 
B 
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Ifwe apply cache sensitive loop tiling, IRAMreu^anHfi.n^- • i- ■ ■ 

example: H w ,yVM reuse Md fi" 1 ^™ miming we can further optimize our 

// transform. c 



. block = blocks [k*6] ; 
XPPPreload(IDCTROW CONFIG) • 
XPPPreloadtO^locCe^e) ; ' // TB , U . 

™*«l©a*aLaan<l, block 64*6); // S 64 fr ° m 6 bl ° ck S 

^-^teflDCTROW^CONFIG^R^f^J^fff 1 * nd a «^n to the 6 blocks 

3CE»PPjreload(IDCOLUMN CONFIG) • 

XPPPr-eload (1, block, 64*6) ; ' // ^ 

XPPEsceeute (IDCOLUMN_CONFlG, IRAM (1) ^IRAM^2 "j"; ~ > elimi "*ted 

^Ci^SS rDCOL^cONPlG has to he modified for 
dented byanadditionals ^ 

lock offset 

-- 



b'ock_count = 6 



The table contains architectural parameters for mrreftw 0™.™ 

the final result It relies on a cache** ^ Hblf t^S?S 7 CONFIG and COLUMN CONFIG of 
executed in this example the ^S^^J^t^^ 1 ^ AstWO are 
configuration cycles are 2 x (block WXM + O?^^ ^ ^ the totaI 
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Performance Considerations 



P^l:^ means that .any operations are 

a hypothetical superscalar RISC-S^ W^rne 11^°?? * P? S ° ,Uti ° n 13 COm P^ "> 
RISC executes on average two operations^ pallet " ,OUe W ' dth ° f ^ Which meaQS the 



lr , /w Ops for Row/Column Est. RISC cycles 

D/ST 16 2 32 

ADO/SUB 35 ■ J 

MULT 1.1 2 If 

SHIFT . 18 2 22 



SAT 



1 18 



-.8 4 30 

issue Width 2TS5" 
Cyc/Row(Col) "~ 73" 



Proc. Rows 8 ft on 

RISCCyc/Blk ^355- 

XPP Cyc/BJk : T>sir 

. J - with data duphcation+reordering 24. 

Speedup jo with data dupllcation+reordering 52 

Jj input data ^ . single lRAM 
the potential speedup significantly m^ot£f Zl A^ ^ « Parted) reduces 

input values per cycle but w e are loading SJe™, T * ** is *> P^ess eight 

every eighth pipeline stage i s filled. Thelgt B2?5£S^!^ ^ ^ ^ that °^ 




without wilf| 
ata duplication data duplication 



• the hardware section. * throughput to the ptpelme 1S data duplication as described in 
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XPPPreloadClean(l, block 64*6^- // - 

to: < *l°ck, 64 6), // erase irami and assign to the 6 blocks 

tmpsize =64*6/8; • • 

XPPPreloadCleandl block^tmosiff' ! mpSl2e); " IR AM10 for intern Sit J 

XPPPxeloadclean(13, bioclc+s*^ 7 ' tm P Si2 e) ; // IRAM12 for interm n 
XPPPraloaddeS Ml M°S:i:!^ Si2e ' fP-fwOl // IRAM13 for SS™" } 




^C^ esm ^^^^-^^l " ™» inter,. J 



RAMO 

P CO \S 

RAM7 ||_£ 
RAMS 



RAMI 5 



IDCTROW_CQNK|G 



BIX 



CattflRJlQ 

<«ei»on>B&s 



<>«O!k0toBK5 



IRAMD 

up to 
IRAM7 

IRAMs 



fRAttlS 



IDCTCOLUMN_CONFIG 




01 Bit Ob (lbs 



Row? 



IRAMO 
upio 

IRAM7 
IRAM8 



REOROER^CONFIG 



Raul 



R0«S 
804 



litem 



Rn>7 

ao.5 



[motor 
'TOO 



IRAM13 



Disks 



°; b ^«o-^da«a TOrferingfinallytransfonnsthe e 

•// transform, c 



block = blocks [k*6]; 
XPPPareloaei ( I DCTROW_C0NFIG ) - 

SpS°=^|^^ u ^- k : «X6) // load I RAMO up to IRAM7 with i w 
, ^^aloadCl^rs^bfocSS*^!- 36 ^^^ into 8 ^ bl °° ks 

assess: £ 5 - - sss ss di : g 

XPPPreloadClIS in' hi J L '° PSize ' tmpsize) // T nterin - ^lt 2 

oadcieaadO, bloc k+2 * tinpsi2e , tmpsi , e) , # Jor inte™. git 2 

M9 
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for interm. 
IRAM15 for interm. 



XPPPreloadClean 13 blMkJ^SSi^' f^'J" ' IRAM12 for inte *nw 
XPPPraloadClean 14 bloc£+6*X^' ? PSi2e) ' " IRAM1 3 for interm. 
^PPreloaacieanUS,' £o^7^ s 1K; gS^ f # IRAM14 

XPPPreload( REORDER CONFIG) • 
XPPPreloadMulhipleToxFF, block 
rsltsizs = 64/ // 64*6/6; 

WPPreloadcieanf 8, block+0*rslt s ize, rsltsize) • 
XPPPreloadClean ( 9, block+l+r-^H-** tl 71 z ' ' 
xppPxeloadClean 10 block+2*«iJ • ' rslt3i ^); 
XPPPxaloadCleaa 11 bi«?£™ "Jt"«e, rsltsize) , 
XPPPrelcadSS 12,' r «" a f { ' 

XPPPreloadCleandS, block+4*«^S"' rS ^ Siza ' 
execute ( IDCOLU^CONPIg:!^^-^; i££ «"5 ) >' ; 



64x6) // id IRAMQ-IRAM7 with interm. 



// I RAMS 
// IRAM9 



for final 
for final 



// IRAM10 for final 
// IRAMli for final 
/./ IRAM12 for final 
// IRAM13 for final 



Rslt 2 
Rslt 2 
Rslt 2 
Rslt 2 
Rslt 2 



Rslt 2 

Rslt 
Rslt 
Rslt 
Rslt 
Rslt 
Rslt 



720 
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5.6 Wavelet 

5.6.1 Original Code 

void f ©rward_wavelet ( ) 
int i,nt, *dmid; 

St If d - tn,P °' d ' tmp1 ' d -^< a -t-PO, s_t ropl; 

int *x; 

int 3f256J,df256] ; 

fo Li n ^ c ? L -' nt>=ax ' 0CK - SIZE '-«t»=i) { 

U= 0/i <nt*COL/*i:mp_ntV;i + 4oL) { ' 

x = &int_datafi) ; 
mid=(nt»i) -l; 

S C0] = x[0]; 

S CU = X[2] ; 

s[mid) = x[2*midj/ 
d(midj = x[2*mid+ROW] ; 

dCO] = Cd[0]«l)- s[O j- U] . 
s[0]=s[0] + (d[0]»2, ; • 

d_tmpO = d[0] ; • 
s_tmpO = SflJ; 

fOr(ii=l ; ii< mid; { 
s_trapl => x[2*ii+2] ; 

^ s_trapO - s^tmpl \ 

d[mid] = (d[mid]-stmid])«i; 

* [ntid)- s [mid] + ( ( d [mid-1 ] +d[in id] , » 3 , ; 

for (i±-0/ ii<=mid; / 

x [ii+mid+l)=d[ii] ; 

>' 

for (i=0;i<nt;i++) { 

x = &int__data [ij ; 
mids={nt»i)-i ; 

. s[0J - x[03'; 
• .d[0] = x[COL]/ 
s fl] = xfCOL<:<l); 
^H! = x t{COL«l)* mid] . 

'2* 

Cmfif a n ev 70 it 9 .In I i 1 fi ; 99 
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} 



d[O] = (d[0]«i,_s ro ^ s[1] . 
stO]= s [0] + (d [0j»2); 

d_tmpO = d[0]; 
s_tmpO = s[l] ; 
for(ii=i ; iKmid; ii++) ( 
•* tnpl = xI2*C0L*(ii+i)] . 

} s -_tmpO = s_tmpl; 

d(mid] = (dtmidj«i) _ {s fmiri1 . 

• - tod] * < (d t-ii:{K}s j ; » 3> # 

for(ii= 0 ; ii<^ id . { , . 

x[ (ii+mid+l)*C0L]=d[iij, 



5-6.2 Optimizing the Whole Loop Nest 

void forward_waveletf) 



int i,nt, *dmia, 

xnt *sp, * d p, "d tmpO, d tmol a *. • 

mt mid, ii; ~ P / a^tmpl, djanpx, 3 t mpO, s tmni- 

int *x; -*■«■, 

int s{256j,df256] ; 

fo ^ ^t«64;nt>= 16 ;nt>>=1) { 
for (i=0;i<„t* 64;i+=64 i; { { 

x = &int_data[i] ; 
mid=(nt»l)-i ; 

. SCQJ = x[0) ; • 
•. d tOJ = x[l} ; 

3[1] = x[2],- 

s[m±dj = «[2*midj ; 

dCmidJ = xC2*mi<a+i] ; 



ci[0] = (d[0]«i)- s[0] _ s[ x , 
s £0J=3£O] + (cl f o]>>2),- J ' 



d_tmpO a d(0] ; 
s_tmpO = sfij- 



Fmof anss 7f> i I ?..lnli Ifi:?? 
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for(ii=l ; ii<mid; ii++) { 

d[ii].-(<x£2*ii +l])<<1) _ a t Q _ xf2 „ ii+2] . 

s[ii] = s tmpO + C(d_tn,pO + dUi])» 3 ); J ' 
d_tmpO = d[ii] ; 
s_tmpO = sfii] ; 

dfmid] = (d(mid]-s[mid] )«l ; 

s [mxd] = s [mid] + ( ( d [mid-l ] + d [mid] ) »3 ) . 

for(ii=0; ii<=mid; ii++) [ 
• x[ii]=s[iij ; 

x[ii+mid+i]«d[ii] ; 

} 

for (i=0;i<nt;i++) { 
x = &int_data[i] ; 
*ttid=(nt>~l) -1; ■ ■ 

s[0] = x[0]; 
d[0] X[64]; 
S[l] = x[128]; 
s[mid] = x[3.28*mid]; 
d[raid] « x[128*mid +64]; 

d[0] = (d[0]«l)- S [O]-s[ii . 
*C0]»s[0] + (d[0]»2), 

d_tmpO = d[0}/ 
s_tmpO = s[l]; 

for(ii=i, ii<mid; ii++) { 

d_tmpO = dTiiJ; L*iJ;^>JJ. . 

^ s^_tmpO = s [ii] ; 

s [mxd]= s £mad ] + ( Cd [mid-1] +d [mid] ) » 3 ) ; 

£or(ii=0; ii«*mid; ii+ +) / 

x[ii*64]= S [ii] ; 
^ x£ (ii+mid+X)*64]=d[ii] ; 



Then we have 4 tables nn» ^„ , . 



is are 
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Parameter 



Vector length 



Reused data set size 



I/OlRAMs 



Value 



mid 



ALU 



BREG 



FfcJEG 



Data flow graph width 



Data flow graph height 



Configuration cycles 



mid 



Parameter 


Value 




VftCtnr Ipnorh 

rvcuseu uaia set si2e 


mid-2 


AI Xl ~ ~" " 


6 


BREG 


6 
0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 




Configuration cycles • f 


6+(mid-2) 



^t t " ri 6 t:>?.rr loop • ^^^ssa^tr- 

for (i=0;i<nt*6<l;i+=64) { 1 
x = iint_data[i] ; 
mid=(nt»i)-i; 

s[0] = X [0J; 
d(0] = x[lj; 
s[l] = k{2); 



s [mid)- 
d(mid) 



d[0] = {d[0]«a)- S [O]-s[i] ■ 
s[p]«»s[0] + (d[0]»2); 

d_CmpO « <3[0] ; 
S_tmpO = s[l] ; 
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) 



for{ii=i. ii< micl; { 

=((X[2*ii+l])«l) - s tmpO - x[2+i±+21- 

& I-SSi 0 ^ (d - tmp0 + -i"n»3)? 1 2] ' 

^ s_tmpO = s[ii] ; 

for(ii=«l; ii<mid; ii++) { 
^ x£ii+mid+l]=d[ii] , 

d[mict] = (d[mid]~5 [mid])«i ; 

s [mid] =s [mid] + { (d [mid-1] +d[mid] ) »3) ; 

x[0]= S [0]; 
X[mid+l]=» d [0]-; 
-x [mid] = s [mid] ,- 
x[2*mid+l]= dfmidj; 



for (i=0;i< nt ;i ++ ) { 

x = &int_data[i] ; 
mid=(nt»l)-i ; 

S[0] = X[0]; 
d[0] « X[6.4J ; 
a[l] = x[X28]; 
s[mid] - x[128*mid]; 
d[mid] = x[128*mid +64]; 

dI0] = (d[O]«l)- s£0 ]- s[1J . 
s[0]=s[0] + (d[0]»2); ■ 

d_tmpO = d[0]; 
S_tmp0 «. s[l] ; 
for(ii=i ; ii<mid ; i i++) { 

d[ii] =(x[128 *ii+641«:n i. « * n 

^ s_tmpO = s[±i] , 

for(ii=l; ii<mld; ii++) ( 

X[ii*64]=. s riiJ; 
^ x[ (ii+mid+l)*64]= d[ i i j ; 

S fmid] =s [nud l + < W tmid-1 ] + d [mid] j ! » 3 , , 
x(O]= S [0]; 

X[(mid+l)*64]=d[0]; 
x[mid*64]=s [mid]; 

x[(2* mi d-f-l)*64]=d(mid); 
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inary 




for (nt=64;nt>= 16,-nt»=i) { 

**.U-0,±<nt*tf4 /*tmp_nt*/ ;i+ = 64) { 

j nicJ=(nt»l)-i ; 

sfOJ = x(0j ; 
d f0] = ; 
SflJ - X[2]; 
stmicJ] = xl2*aid], 
d£mld] - x[2*mid+l] r - 

d ;jJ-(dtO]«l)-s[0)-a[ll - 
S[0J=s[0j + (cl[0]»2); 

d_trapO = d[0] ; 
S__tmp0 = s[l] • 
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for(ii=i ; ii<mid; { 

djiij = ((j[2*ii + i])«i, - s tmp0 , x(2 * ii+2] . 

S_tftip0 - S[ii] ; 

=» S[ii] ; — 

^ x(ii+ m i d +l] = d[ii]; 

d fmidj = (d CmidJ -s [mid] ) «i ; 
* [mid] = s £ mid } + ( (dlaid-l J +d[mid] ) »3) ; 

X[0]=sf0]; 
x[mid+l]=d[0] ; 
x. [mid] =s [mid] ,- 
} xC2*mid+i]«= d[mid]; ' 

for (i=0,-i<nt;i++) {■ 

• x = &int_data[i] ; 
mid={nt»i)-i, 

SCO] = X [0]/ 
d[0] w x['64]; 
S[l] = x[128] ? 
s[mid] a x[128*mid] ; 
dCmid] = x[l28*mid +64]; 

drO]-«j[oj«i)- 8[0 j 
S[0]=s[0] + (d[0]»2); 

d_tmpO = d[0] ; 

S_tmpO - s[l]; 

for(ii=l ; ii<mid/ ±l++j ( 
•d[ii] .='(x[i28*ii+64i«ii - * « 

^ :-ss!*! drirns,r," tla, * witi »' 

s_tmpO «. a [11]; 
^ x[{ii+mid+l}*64]=d[ii] ; 

• tm^d] =s [mid] + ( fd [mid- J.] +d [mid] ) » 3 ) , 
*£0]=s[0]; 

x[(mid+l)*S4]=d[0j - 
x[mid*64]=s[mid] ; 
xt<2*jnid+i)*6«j-d[ Xtt±d] / 



) 



A«er loop f usfo „, we o aly have ,„„ loops> ftat have to same 



727 

Empf anssze i t 2 . J u I i 16:22 



I^HI.-HINU). f. fltlKUl^ 



<<±i. HD3JKIO a.OJ/OD 



EahMUnbatannte^^ 



Parameter 

Vector lenerth 


Value 
mid-2 


Reused data set size 




I/O IRAMs 
ALU 


8 




BREG 


6 
0 


FREG 

Data flow graph width 
Data flow graph height 


2 " 
2 


Configuration cycles ~ 


6 

6+(niid-2) | 



then that w ^ can take the vaJuTu? ITft^ES" *" ^ ™ e 
efficiently on the PACT XPP. This means th^th^ ™ COI ^stent loop bounds can be handled more 
replaced by a constant value. ThiTwHl hanne ^enTf ^ J* better °P timized ^5 * 
obfctn a bigger code, but with 3 insta^cesTS £5 b ° e ° P * ™s way we wi » 

This can be seen as a kind of temporal D Miri™;„I n l bemg candld *te for a configuration 
six new loop nests. ^ P art,tlon "'8- ^ the outer I oop is completely unrolled^ng 



x » &int_data[i]; 
mid=3l; 

s[0] ■= X [0]; 
d[0] = xflj; 
s(l] = x[2]; 
st?l] = x[6l]; 

d[OJ = (d[0]«l,- s[0 j- s[1] . 
-s[0J=s[0] + (dt0]»2); 

d_tmpO =» d[0]; 
s_tmpO = s[lj,' 

for(ii=l- ii<31; ii+ +H 

d_tmpO = d[ii] ; 1 J ; ^- 3; ' 

s__tmpO = s[ii] ; 

X[ii]-S[i±J; 

d[3l] = (d[3a]-s[31]>«i ; 
s[31]=sf31]+((d£30]+d(3l])» 3 ) ; 



126 



FraofanssTR i t 9..lnli lfi;?? 



kJi- J Ul_-^t3t3 J 1 b = JO 
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x[32]=d[0] ; 
X[31]= S [31] ; 
^ x[63]=d[31J ; 

for (i=0;i<64;i++) { 

x «=> &int_data[i] ; 
raid=»31; 

s£0] = x[OJ; • 

<i CO] » X[64]; 
s Ll) = x[128]; 
9 [31] - x[3968] ; 
d[31] = x[4032] ; 

d[0]= S (d[0]«i)-s[OJ- s [i]; 
sC0]=s[0] + (d[0]»2) ; 

d_tmpO = d[0]; 
s_tmpO = sfl]; 

for(li»l; ii<3l; ii++) { 

d_tmpO = d[ii] ; 
s_tmpO = sfiij ; 
x[ii*64]=s [ii] ; 
^ x[(ii+32)*64]=d[ii], 

dr3lj = (d[31]«i) - (S [31]«1). 
9 [31]=s[3l] + ((d [ 30] + d t 31])>>3); 

x[O)=s[0),-. 
x[2048]-d£0] # - 
x[1984]=s[31] ; 
x[4032]=d[31] ; 



for Ci=O;i<204 8/ -i + * 6 4) { /* nt = 32 #/ . 

x = s'int_data[i] / 
. mid=15; 

sCOJ = X[0]; 
d[0] = X [1J; 
S[X] - X[2]; 
s[15] m X (30J- 
d[15] = x[31J; 

d[O]=»(d[0]«l)- s r O r- sri1 . 
Sf0]= s[0 3 + (d(0]>i 2 ); 

d^tmpO = d[0); 
s_tmpO «■ sul- 



fas 
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for(ii=l; ii<15; ii++){ 

<J[ii] -((x[2*ii+l])«l) - s trap o - X f2*ii+21. 
»[ix]- s_ t m P 0 + ((d_tmpO + d(ii])» 3 ,r 
a tmoO a nun. ' ' 



•d_tmpO = d[ii] ; 
s^tmpO ■= s[iij ; 
x[ii]« S (ii]; 
^ xtli+16]=d[iij; 

d[15] = (d[15]- S [15] 
sCl5)= s tl5]_+{( d[1 4] +d[15 j )>>3); 

x[OJ=s[0] ; 
x[16]«=d[0] ; 
x[15]=s[15] ; 
^ x[31]=dfl5} ; 

for (i=0;i<32 ; i ++) { 

x = &int_data[i] ; 
mid=15; 

SCO] « X [0] ; 
d[0] = X [64] ; 
' sfl] » x[128J; 
s[15j = XC1920] ; 
d.[15] = x[1984J; 

d jJJ-<dtO]«l)- aIO j- aU] , . 
S[OJ=s[0] + (dC0]»2) ; 

d_tmpO = d[0J; 
s_tmpO = s [1] ; 

fpr(ii»l; il<l 5; ( 

d_tmpO = d[ii] ; ~ 
.s^tmpo = s [ii] ; 
x£ii*64]= s [ii] ; 
x[ (ii+16)*64]= d [ii] / 



.- dU5]-(d[15]«i) -(s[a5]«l) • 



x[0]=q[0]; 
xri024]=d[0] ; 
X[960]=s[l5] ; 
xfl984]=d[15] ; 



for (i=O;i<i024; 14=64) ( /. nt . 16 



x = &int_data[ij ; 
mid=»7; 

SCO] = x[0] ; 
dfO] = Xfl]; 
•II] - x£2]; 
S PJ = *[14] ; 
d P] = x{15]; 
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<3(0]=Cd[0]«l)- s [0]-sri]; 
s[()3=sCO] + (ci[0]»2) ; 

d_tmpO = d[0] ; 
s_.tmpO = s [1] ; 



for(ii=l; ii< 7; ii+ +){ 

dUi] =((x[2*ii + l))«ij - s tmpo - xr2*i±+2l. 



d_tmpO = d[ii] ; 
s^tmpO = s [iij f 
x[±i]-s[±i] ; 

d[7]-(d[7J- S [7}).«X; 
. at7J-s[7] + ((d[6]+dt7])»3j.; 

x[0]- S [0] ; 
x[8]=d[0] ; 
x[7]=s[7]; 
X(15J = d[7] ; 

} 

for (i=0;i<16;i++) { 

x = &int_data[i] ; 
mid=7 ; 

• S[0J = x[0]; 

<S[0] = x{64J; 

Stl] = X[128] ; 

s[7] = x[896] ; 

d(7] = x[960]; 

d[0] = (d[0]«i)- sfO] . sfl} • 
s£0]=»s[OJ + (d[O]>*2) ; . 

d_tmpO = d[0] ; 
s_tmpO - s [1] ; 



for(ii.=l; ii< 7 ; { 

d_tmpO = d[li] ; 1 ' J JJ ' 

s_tmpO S [ii] ; 

X[ii*64]= S [ii3; 

^ x[(ii+8)*64J=d[ii] ; 

dt7] = (df7]«i, -( SC 7 3«1,; 
s[7]- s[7 J + ((cl[6J+d[7J)>> 3 ); 

x[0]=s [0] ; 
x[5l2]=d[0]; 
x£448]=s(;7] / 
. ^ x[960}=d[7] ; 
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5.6.3 Optimizing the Inner Loops 



loops for commodity reasons. W 1o ° P bod,es - BeIow «• Present only the unrSJd bS 

First loop: 

forjii^l; ±±<31; ii=ii+2 ' 
s_tmpO = SfiiJ; 

. S£ii]; 

s__tmpO = s£il+i], 
xtii+lj » S[ii+1] / 
x[ii+33] • <J[ii+l] ; 



Second loop: 



- s tmpo - x[i28*(ii+in . 
d[ii])»3); Ur 



) 



t ii = (x i 128 * i i+64]«{, 
S£U] - s_trn P 0+ f (d- tmpO + 

s_tmpO = s[ii] ; 
xfii-64] = S[iij; 
*f (li+32.)*64J =* dtii] , 

s_tmpO = s[ii+i]- 
xt<ii + 33)*64] = dUiiiJ. • 



tt2 



o i.. i ; i c • oo 
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Third loop: 

for(ii=i ; ii<15; ii=ii +2 ) { 

dfiij - (Uf2*i 1+1J ,« 1) . s t Q . Kf2 » ii+2} . 
sUi] - s_tmpO+(Cd tmpO + d[ii]T»3). ' 
d_tm P 0 = d[iij; ~ Q mJ>»3), 
s_tmpO = S [ii] ; . . 

= s[i±] 7 
x[ii+16). - d[iij,- 

dfii+lj = ( (x[2*(ii+i)+i ])<<1 , - «* n 

s[ii+lj , s_tn,pO +({d J ni pJ , + << i ii + lT)>?3): * t2 * (ii+1 > + ^ 
d_tmpO ■= d(ii+i] ; i J. J . 

s_tm P 0 = sfii+lj; 

* sfii+lj. 
^ x[ii+17J - d[ii+i; 

Fourth loop: 

. for(ii=l; ii<15; ii-ii+2) { 

d_trapO = d[ii],- 1 J ' 

s_tm P 0 - s(ii']; 

x[ii*64] * s[ii] ; 

x[ (ii+16) + 64] B d[ii] ; 

s_tmpO = s[ii+i] ; 
»[{ii+l)*64] = s[ii+i]- 
} *nii+17)*64) = dtii+i]'; 

Fifth loop; 

s_tm P o = s[ii] ; 
X[i±] = S[ii]; 
JC[ii+8] = d[ii]; 

s^tmpO- =» s [ii+l] ; 
xfli +lJ o s [ii+l] • 
} X[ii+9] « d[ii+lj; 
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Sixth loop: 

for(ii»l; ii<7; ii=ii+2) { 

dfii] = (x[128*ii + 64]«l, - S tmpO - xf 128* fii+m - 
s[ix] - s_tmpO+ ( (d- tmpO + d[ill)>> 3 ,. (Il+l) ] 

' d_tmpO = d[ii]; _ ■ U l"-J 
s_tmpo = sfii] / 
x[ii*64] = s[ii] ; 
X[{ii+8)*64J = d(ii] - 

• • d_tmpO = d[ii+l]; in+ijj»3), 
s_tmpO = s[ii+l] ; 
X[ (ii+l) *64] = s[ii+i] ; 
^ x[(ii+9)* 64j - dCii+1]; 

«ho o,hor loops, only rhe ^mI*JS£XEZ *g£* »- "*>P- To ob tein ^ for 
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Each input and output data will occupy an iram ,*> .„j .n -., .. ■ 

«awmg djen toe metgo operations uZ&Z^£°Z£ l£L * ^ °°' y Va ' ues ta ** 

^ndin^^ 

^XSSfti^a^JaSi^^ wid, tospoo, to a 
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A Method for Compiling High-L cvel Language Programs to a SeconfiguradJe Dam-Bow Processor ? 

1 Introduction 

toct^m^ 3 ?fv.° d f ° r a SubS6t ° f a h? B Wwd P">^mming language (HLL) 

lite C or FORTRAN, extended by port access functions, to a reconfigure data-flow processor (RDFP 
as desenbed m Section 3 . The program is transformed to a configuration of the 

This method can be used' as part of an extended compiler for a hybrid architecture consisting of Slandard 

SSST/lScT^ Tl.e extended comp U e™e?fl7^ 

Iftd ^ ^? u ? maps suitable P ro gran> pans like inner loops to the coprocessor and the 
of the program to the host processor It is also possible to map separate pnSTSL « , «S 
configurations. However, these extensions are not subject of this docame^ P 

2 Compilation Flow 

This section briefly describes the phases of the compilation method. 

2.1 Frontend 

The compiler uses a standard frontend which translates the input program (e g aCnm^ - 
temal format consisting of an abstract syntax tree (AST) and IwnESli iJ J" 

[2] is an example of a compiler providing such a frontend. £ 1 * SUIF C ° mpiler 

2.2 Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow eranh ^mpn\ ..- 

*«. ibis is ». ^ect of ft,, pasa'r - ■»* «~ 

23 Configuration Code Generation 

3 Configurable Objects and Functionality of a RDFP 

' SSSSK^™? of a RDFP. a PC**,, tapfen^ 
»d singte-bi. cLral si^S^XS^ dS """f 4 mum - bit 
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3.1 Configurable Objects and Functions 

An RDFP consists of an array of configurable objects and a communication network. Each object can 
be configured to perform certain functions (listed below). It performs the same function repeatedly until 
the configuration is changed. The array needs not be completely uniform, i. e. not all objects need to be 
able to perform all functions. E. g., a RAM function can be implemented by a specialized RAM object 
which cannot perform any other functions. It is also possible to combine several objects to a "macro" to 
realize certain functions. Several RAM objects can, e. g. , be combined to realize a RAM function with 
larger storage. 



A B 



opcode 



LB UB INC 
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E-H lr *-U 
1-CONSTANT 



*U 



INVERTER 



START —\ Tseql |-» U 
ESEQ 



• ECOMB 

Figure 1: Functions of an RDFP 

The following functions for processing data and event packets can be configured into an RDFP- See Fig. 1 
for a graphical representation. 

• ALU[opcode]: ALUs perform common arithmetical and logical operations on data. ALU func- 
tions ("opcodes") must be available for all operations used in the HLL. 1 ALU functions have two 
data inputs A and B, and one data output X. Comparators have an event output .U instead of the 
da ta output. They produce a 1 -event if the comparison is true, and a 0-event otherwise. 

• 'Otherwise programs containing operations which do not have ALU opcodes in the RDFP must be excluded from the 
supported HLL subset or substituted by "macros" of existing functions. 
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• CNT: A counter function which has data inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next output value (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, the counter counts 
continuously. The output events U, V, and W have the following functionality: For a counter 
counting N times, N-l 0-events and one 1 -event are generated at output U. At output V, N 0-evems 
are generated, and at output W, N O-eveftts and one 1 -event are created. The 1-event at W is only 
created after the counter has terminated, i. e. a NEXT event packet was received after the last data 
packet was output. 

♦. RAM[size]: The RAM function stores a fixed number of data words ("size**). It has a data input 
RD and a data output OUT for reading at address RD. Event output ERD signals completion of 
the read access. For a write access, data inputs WR and IN (address and value) and data output 
OUT is used. Event output EWR signals completion of the write access. ERD and EWR always 
generate O-events. Note that external RAM can be handled as RAM functions exactly like internal 
RAM. 

• GATE: A GATE synchronizes a data packet at input A back and an event packet at input E. When 
both inputs have arrived, they are both consumed. The data packet is copied to output X, and the 
event packet to output U. 

• MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data output X. If 
SEL receives a O-event, input A is copied to output X and input B discarded. For a 1-event, B is 
copied and A discarded. - 

• MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X. 
If SEL receives a O-event, input A is copied to output X, but input Bis not discarded. The packet 
is left at the input B instead. For a 1 -event, B is copied and A left at the input 

• DEMUX: A DEMUX function has one data input A, an event input SEL, and two data outputs X 
and Y. If SEL receives a O-event, input A is copied to output X, and no packet is created at output 
Y. For a 1-event, A is copied to Y, and no packet is created at output X. 

• MDATA: A MDATA function multiplicates data packets. It has a data input A, an event input 
SEL, and a data output X. If SEL receives a 1-event, a data packet at A is consumed and copied 
to output X. For all subsequent O-event at SEL, a copy of the input data packet is produced at the 
output without consuming new packets at A. Only if another 1-event arrives at SEL, the next data 
packet at A is consumed and copied. 2 

• INPORTfname]: Receives data packets from outside the RDFP through input port "name*' and 
copies them to data output X. If a packet was received, a O-event is produced at event output U, 
too. (Note that this function can only be configured at special objects connected to external busses.) 

• OUTPORTfname]: Sends data packets received at data input A to the outside of the RDFP through 
output port "name". If a packet was sent, a O-event is produced at event output U, too. (Note that 
this function can only be configured at special objects connected to external busses.) 

Additionally, the following functions manipulate only event packets: 
J Note dial this can be implemented by a MERGE with special properties on XPP™ . 
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A Method for Compiling High-Level Language Programs to a Recontigurable Data-Flow Processor 6 

An RDFP requires register delays in the dataflow. Otherwise very long combinational delays and asyn- 
chronous feedback is possible. We assume that delays are inserted at the Inputs of some functions (like 
for most ALUs) and in some routing segments of the communication network. Note that registers change 
the timing, but not the functionality of a correct CDFG. 

4 Configuration Generation 

4.1 Language Definition 

The following HLL features are not supported by the method described here: 

• pointer operations 

• library calls, operating system calls (including standard I/O functions) 

• recursive function calls (Note that non-recursive function calls can be eliminated by function in- 
lining and therefore are not considered here.) 

• All scalar data types are converted to type integer. Integer values are equivalent to data packets in 
the RDFP. Arrays (possibly multi-dimensional) are the only composite data types considered. 

The following additional features are supported: 

INPORTS and OUTPORTS can be accessed by the HLL functions getstream(name, value) and put- 
stream(name> value) respectively. 

4.2 Mapping of High-Level Language Constructs 

This method converts a HLL program to a CDFG consisting of the RDFP functions denned in Section 3.1. 
Before the processing starts, all HLL program arrays are mapped to RDFP RAM functions. An array x 
is mapped to RAM RAM(x). If several arrays are mapped to die same RAM, an offset is assigned, too. 
The RAMs are added to an initially empty CDFG. There must be enough RAMs of sufficient size for all 
program arrays. 

The CDFG is generated by a traversal of the AST of the HLL program. It processes the program state- 
ment by statement and descends into the loops and conditional statements as appropriate. The following 
two pieces of information are updated at every program point? during the traversal: 

• START points to an event output of a RDFP function. This output delivers a O-event whenever 
the program execution reaches this program point At the beginning, a 0-CONSTANT preloaded 
with an event input is added to the CDFG. (It delivers a O-event immediately after configuration;) 
START initially points to its output This event is used to start the overall program execution. The 
START new signal generated after a program part has finished executing is used as new START 
signal for the following program par ts, or it signals termination of the entire program. The START 

s in a program, program points are between two siaiemenis or before the beginning or after fte end of a program comnonem 
ukc a loop or a conditional statement. 
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A Method for Compiling High-Level Language Programs to a RecotiGgorable Data-How Processor 7 

events guarantee that the execution order of the original program is maintained wherever the data 
dependencies alone are not sufficient. This scheduling scheme is similar to a one-hot controller 
for digital hardware. 

• VARLIST is a list of {variable, fimction-output} pairs. The pairs map integer variables or array 
elements to a CDFG function's output The first pair for a variable in VARLIST contains the 
output of the function which produces the value of this variable valid at the current program point 
New pairs are always added to the front of VARLIST. The expression VARDEF(var) refers to the 
junction-output of the first pair with variable var in VARLIST. 6 

The following subsections systematically list all HLL program components and describe how they are 
processed, thereby altering the CDFG, START and VARLIST. 



4-2-1 Integer Expressions and Assignments 

Straight-line code without array accesses can be directly mapped to a data-flow graph. One ALU is 
allocated for each operator in the program. Because of the self-synchronization of the ALUs, no explicit 
control or scheduling is needed. Therefore processing these assignments does not access or alter START. 
The data dependences (as they would be exposed in the DAG representation of the program [1]) are 
analyzed through the processing of VARLIST. These assignments synchronize themselves through the 
data-flow. The data-driven execution automatically exploits the available instruction level parallelism. 

All assignments evaluate the right-hand side (RHS) or source expression. This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments, the 
left-hand side (LHS) variable or destination is combined with the RHS result object to form a new pair 
{LHS,result(RHS)} which is added to the front of VARLKT. . 

The simplest statement is a constant assigned to an integer: 7 

a = 5.; 

ft doesn't change the CDFG, but adds {a, 5} to the front of VARLIST. The constant 5 is a "pseudo- 
object" which only holds the value, but does not refer to a CDFG object Now VARDEF(a) equals 5 at 
subseqem program points before a is redefined. 

Integer assignments can also combine variables already defined and constants: 
b - a * 2 + 3; 

In the AST, the RHS is already converted to an expression tree. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant, the ALU's input is direcdy connected to that constant. If a leaf note is an integer 
variable var, it is looked up in VARLIST, i. e. VARDEF(var) is retrieved. Then VARDEF(var) (an output 
of an already existing object in CDFG or a constant) is connected to the ALU's input The output of the 
ALU corresponding to the root operator in the expression tree is defined as the result of the RHS. Finally, 
a new pair {LHS, result(RHS)} is added to VARUST. If the two assignments above are processed, the 

6 This method of using a VARLIST is adapted from the Transmogrifier C compiler [SJ. 
'Note that we use C syntax for the following examples. 
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, A Method for Compiling High-Level Language Programs to a Reconfigurable Data-How Proce ssor 8 

CDFG with two ALUs in Fig. 2 is created, 5 Outputs occurring in VARLIST are labeled by Roman 
numbers. After these two assignments, VARLIST = [{b, I}, {a, 5}]. (The from of the list is on the left 
side.) Note that all inputs connected to a constant (whether direct from the expression tree or retrieved 
from VARLIST) must be defined as constant Inputs defined as constants have a small c next to the input 
arrow in Fig. 2. 

4.2.2 Conditional Integer Assignments 

For conditional if -then-else statements containing only integer assignments, objects for condition eval- 
uation are created first The object event output indicating the condition result is kept for choosing 
the correct branch result later. Next, both branches are processed in parallel, using separate copies 
VARLIST1 and VARLIST2 of VARLIST. (VARLIST itself is not changed.) Finally, for all variables 
added to VARLIST1 or VARLIST2, a new entry for VARLIST is created (combination phase). The valid 
definitions from VARLIST1 and VARLIST2 are combined with a MUX function, and the correct input 
is selected by the condition result For variables only defined in one of the two branches, the multiplexer 
uses the result retrieved from the original VARLIST for the other branch. If the original VARLIST does . 
not have an entry for this variable, a special "undefined" constant value is used. However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live [1] after the 
if-then-else structure need to be added to VARLIST in the combination phase. 9 

Consider the following example: 

i - 7; 
a = 3; 

if (i < 10) { 

-a = 5; . - 

c = .7; 
} . 
else { 

c = a - 1; 

d = 0; 

) 

Kg. 3 shows the resulting CDFC5, Before the if-then-else ■construct, VARLIST = [{a, 3} {i, 7}] After 
processing the branches, for the then branch, VARLIST1 = '[{c 7}, {a, 5}, {a, 3}, {i, 7}]. and "for the 
else branch, VARLIST2 » [{d, 0}, {c, I}, {a, 3), {i. 7}]. After combination, VARLIST = [{d, II}, {c, 
EI}, {a, TV}, {a, 3}, {i, 7}]. 11 S 1 

Note that case- or switch-statements can be processed, too, since they can - without loss of generality - 
be converted to nested if-then-else statements. 

Processing conditional statements this way does not require explicit control and does not change START. 
Both branches are executed in parallel and synchronized by the data-flow. It is possible to pipeline the 
dataflow for optima l throughput. 

"Note that the input and output names can be deduced from their position, cf. Kg. 1. Also note that the compiler front- 
end would normally have substituted. the second assignment by b - 13 (constant propagation). For the simplicity of this 
exp&nauon, no frontend optimizations are considered in this and the following examples. 

Definition: A variable is live at a program poim if its valucis read at a statement reachable from here without intermediate 
redefinition. 
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A Method for Compiling High-Level Language Programs to a Reconfjgurable Data-Flow Proce ssor 9 
AJZ3 General Conditional Statements 

Conditional statements containing either stray accesses (cf. Section 4.2.7 below) or inner loops cannot 
t be processed as described in Section 4.2.2. Data packets must only be sent to the active branch. This is 
|lachieved by the implementation shown in Fig. 8, similar to the method presented in [4]. 

| A dataflow analysis is performed to compute used sets use and defined sets def [1] of both branches. 10 
* For the current VARLIST entries of all variables in IN = use(thenbody) U def{thenbody) U 
use{elsebody) U def{elsebody) U use(heoder), DEMUX functions controlled by the IF condition are 
inserted. Note that arrows with double lines in Fig. 8 denote connections for all variables in IN, and the 
shaded DEMUX function stands for several DEMUX functions, one for each variable in IN. The DE- 
MUX functions forward data packets only to the selected branch. New lists VARLISTl and VARLIST2 
are compiled with the respective outputs of these DEMUX functions. The then-branch is processed with 
VARLISTl, and the else branch with VARLIST2. Finally, the output values are combined. OUT con- 
tains the new values for the same variables as in IN. Since only one branch is ever activated there will not 
be a conflict dae to two packets afriving simultanuously. The combinations will be added to VARLIST 
after the conditional statement. If| the BP execution shall be pipelined, MERGE opcodes for the output 
must be inserted, too. They are controlled by the condition like me DEMUX functions. 



■ 



| Toe following extension with respect to [4] is added (dotted lines in Fir. 8) in order to control the execu- 
^ | lion as mentioned above with START events: The START input is ECOMB-combined with the condition 
^ • | output and connected to the SEL input of the DEMUX functions. The START inputs of thenbody and 
elsebody are generated from the ECOMB output sent 'through a 1 -FILTER and a 0-CONSTANT" or 
through a O-FELTER, respectively. The overall START new output is generated by a simple "2 to 1 
; connection" of thenbody's and elsebody's START^ outputs. With this extension, arbitrarily nested 
i conditional statements or loops can be handled within thenbody and elsebody. 

424 WHILE Loops 

WHILE loops are processed similarly to the scheme presented in [4], cf. Fig. 9, As in Section 4.23, dou- 
, ble line connections and shaded MERGE and DEMUX functions represent duplication for all variables 
in IN. Here IN = use{v)hilebody) U def(whilebody) U wte(header). The WHILE loop executes as 
follws: In the first loop iteration, the MERGE functions select all input values from VARLIST at loop 
entry (SEL=0). The MERGE outputs are connected to the header and .the DEMUX functions. If the 
while condition is true (SEL=1), the input values are forwarded to the whijebody, otherwise to OUT. 
The output values of the while body are fed back to whilebody's input via the MERGE and DEMUX- 
operators as long as the condition is true. Finally, after the last iteration, they are forwarded to OUT The 
outputs are added to the new VARLIST. 12 

\ Two extensions with respect to [4] are added (dotted lines in Fir- 9): 

• ra 3 * • ^ Vflriablft is ^ ed in a st atement (and hence in a program region containing this statement) if its value is read. A variable 
v 18 defined in a statement (or region) if a new value is assigned to II 

^The 0-CONSTANT is required since START events must always be Governs. 
Note that the MERGE function for variables not live at the loop's beginning and the whilebody's beginning can be removed 
T^rZ??? 18 J 101 use,L For 111656 variaWes . on,v * e DEMUX function to output the final value is required. Also note that 
th. MERGE functjons can be replaced by simple "2 to 1 connections" if the configuration process guarantees that packets from 
IN I always arrive at the DEMUX's inpul before feedback values arrive. 
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A Method for Compiling High-Level Language Programs to a fteconfigurable Dat a-Flow Processor 10 

• In [4), the SEL- input of the MERGE functions is preloaded with 0. Hence the loop execution 
begins immediately and can be executed only once. Instead, we connect die START input to the 
MERGE's SEL input ("2 to 1 connection" with the header output). This allows to control die time 
of the start of the loop execution and to restart it 

• The whilebody's START input is connected to the header output, sent through a l-HLTER/0- 
CONSTANT combination as above (generates a 0-event for each loop iteration). By ECOMB- 
combining whilebody's START***, output with the header output for the MERGE functions* 
SEL inputs, the next loop iteration is only started after the previous one. has finished. The while 
loop's STARTnew output is generated by filtering the header output for a 0-event. 

With these extensions, arbitrarily nested conditional statements, or loops can be handled within while- 
body. 

4.2.5 FOR Loops 

FOR loops are particularly regular WHILE loops. Therefore we could handle them as explained above. 
However, our RDFP features the special counter function CNT and the data packet multiplication func- 
tion MDATA which can be used for a more efficient implementation of FOR loops. This new FOR loop 
scheme is shown in Fig. 10. 

A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB), and increment 
(INC) expressions are evaluated like any other expressions (see Sections 4.2.1 and 4.2.7) and connected 
to the respective inputs. 

As opposed to WHILE loops, a MERGE/DEMUX combination is only required for variables in JiVl = 
def {forbody), i. e. those defined in forbody. 13 INI does not contain variables which are only used 
in forbody, LB, UB, or INC, and does also not contain the loop index variable. Variables in INI are 
processed as in WHILE loops, but the MERGE and DEMUX functions' SEL input is connected to 
CNT s W output. (The W output does the inverse of a WHILE loop's header output; it outputs a 1- 
I e 7u Ae COunter tennioated - Therefore the inputs of the MERGE functions and the outputs 
of the DEMUX functions are swapped here, and the MERGE functions 1 SEL inputs are preloaded with 
t J -events.) 

CNT's X output provides me current value of the loop index variable. If the final index value is required 
(live) after the FOR loop, it is selected with a DEMUX function controlled by CNT's U event output 
(which produces one event for every loop iteration). 

Variables in IN2 = use(forbody) \ def (forbody), i. e. those defined outside the loop and only used 
(but not redefined) inside the loop are handled differently. Unless it is a constant value, the variable's 
input value (from VARLIST) must be reproduced in each loop iteration since it is consumed in each 

L r JS:i: therwise l0 ° P WOuW staJ1 from second » te ^tion onwards. The packets are reproduced 
, by MDATA. functions, with the SEL inputs connected to CNT's U output. The SEL inputs must be 
» preloaded with a 1-event to select the first input. The 1-event provided by the last iteration selects a new ' 
; value for the next execution of t he entire loop. 

• "Note that the MERGE functions can be replaced by simple "2 to 1 connections" as for WHILE loops if the configuration 
process guarantees that packets from INI always anive at the DEMUX's input before feedback values artve. COm,SUraUon 
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A Method for Compiling High-Level Language Programs to a Reconfigarable Data-Flow Processor 1 1 

|The following control events (dotted lines in Fig. 10) are similar to the WHILE loop extensions but 
1 f m P ler - CNT' 5 STA *T input is connected to the loop's overall START signal START new is generated 
from CNT's W output, sent through a 1 -FILTER and 0-CONSTANT. CNT's V output produces one 0- 
event for each loop iteration and is therefore used as forbody's START. Finally, CNT's NEXT incut is 
connected to forbody 1 s STAW nBW output. - • 1 

^ SjE 1 * 60 " 1 ! °° PS (3S defined ba ° w in SeCtion 4,2 * 6) ' lo °P Orations are allowed to overlap. Therefore 
CNT's NEXT input needs not be connected. Now the counter produces index variable values and control 
^*^ faSt *5 *ey can be consumed. However, in this case CNT's W output in not sufficient as overall 
° Utput ance ^ COUDter terminates before the last iteration's forbody finishes Instead 
STABT nesw is generated from CNT's U output ECOMB-combined with forbody's START n ~ outouL 
sent through a 1 -FILTER/CONSTANT combination. The ECOMB produces an event X termination 
of each loop iteration, but only the lost event is a 1-event because only the last output of CNT's U output 
is a 1-event. Hence this event indicates that the last iteration has finished. Cf. Section 4.3 for a FOR looo 
example compilation with and wimout pipeliriing. 

As for WHILE loops, these methods allow to process arbitrarily nested loops and conditional statements. 
The following advantages over WHILE loop implementations are achieved: 

• One index variable value is generated by the CNT function each clock cycle. This is fester and 
smaller than the WHILE loop implementation which allocates a MERGE/DEMUX/ADD Iood and 
a comparator for the counter functionality. 

. Variables in IN2 (only used in forbody) are reproduced in the special MDATA functions and need 
not go through a MERGE/DEMUX loop. This is again faster and smaller than the WHILE loon 
implementation. ^ 

4.2,6 Vectorization and Pipelining 

Slev^ dCSCribed S ° ^ g6nerateS 001705 P^ 01 ^^ ** HLL program's functionality on an RDFP. 
However, the program execution is unduly sequentialized by the START signals. In some cases, ihner- 

Z^hX^/'Tli Th » means thatl ° 0p iterations ™ ov ^P. ^ding to a pipelined dataflow 
? * l0 ° P *** PipeUne Vector ^n technique [6] can be easily applied to 

*move?£S C^l P T med ^ , AS ? en ?° ned ab ° Ve ' f ° r FOR1 °^' * e ^ NEXT mputS 
removed so that CNT counts continuously, thereby overlapping the loop iterations. 

^rlTdZt^T- a T Se !, Can * J***" 1 since datafiow somatically synchronizes 
subTeletf £ST ^ ^ 3 Statement in ° ne iterati0n and too *<* cement in a 

^rbS^, ■"f"" Ca ° bS PipeliDed * ^ RAM) accesses do 

not cause loop-caned dependences or can be transformed to such a form. In this case no RAM address 

9 written in one and read in a subsequent iteration. Therefore the read and write accesses to the same 

RAM may overlap This degree of freedom is exploited in the RAM access ^^"1 LLd b^Tw 

Especially for dual-ported RAM it leads to considerable performance improvements. 

4.2.7 Array Accesses 

to contrast to scalar variables, array accesses have to be controlled explicitly i„ order to maintain the 
program's correct execution order. As opposed to normal dataflow machine models £3] 'a RtS dZ 
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■A Method (or Compiling High-Level Language Programs to a ReconSgorable Data-Flow Processor 12. 

• not have a single address space. Instead, the arrays arc allocated to several RAMs. This leads to a 
different approach to handling RAM accesses and opens up new opportunities for optimization. 

\ To reduce the complexity of the compilation process, array accesses are processed in two phases. Phase 
1 uses "pseudo-functions" for RAM read and write accesses. A RAM read function has a RD data input 
(read address) and an OUT data output (read value), and a RAM write function has WR and IN data 
inputs (write address and write value). Both functions are labeled with the array die access refers to, and 
both have a START eventinput and a U event output. The events control the access order. In Phase 2 all 
accesses to the same RAM are combined and substituted by a single RAM function as shown in Fig. 1. 
This involves manipulating the data and event inputs and outputs such that the correct execution order is 
maintained and die outputs are forwarded to the correct part of the CDFG. 

^Phase 1 Since arrays are allocated to several RAMs, only accesses to the same RAM have to be syn- 
rPchronized. Accesses, to different RAMs can occur concurrendy or even out of order. In case of data 
dependencies, the accesses self-synchronize automatically. Within pipelined loops, not even read and 

* write accesses to the same RAM have to be synchronized- This is achieved by maintaining separate 
<L ST ART signals for every RAM or even separate START signals for RAM read and RAM write accesses 

• in pipelined loops. At the end of a basic block [l] 14 , all START^ outputs must be combined by a 
ECOMB to provide a START signal for the next basic block which guarantees that all array accesses in 
die previous basic block are completed. For pipelined loops, this condition can even be relaxed. Only 
after the loop exit all accesses have to be completed. The individual loop iterations need not be synchro- 
nized. 

First the RAM addresses are computed. The compiler frontend's standard transformation for array ac- 
cesses can be used, and a CDFG function's output is generated which provides the address. If applicable, 
the offset with respect to the RDFP RAM (as determined in the initial mapping phase) must be added 
This output is connected to the pseudo RAM read's RD input (for a read access) Or to the pseudo RAM 
write's WR input (for a write access). Additionally, the OUT output (read) or IN input (write) is con- 

I 7116 START im?ut connected to the variable's START signal, and the U output is used as 

ST AKT new for the next access. 

To avoid redundant read accesses, RAM reads are also registered in VARLIST. Instead of an integer 
yanable, an array element is used as iSrst element of the pair. However, a change in a variable occurring 
in an array index invalidates the information in VARLIST. It must then be removed from iL 

IJ 6 foIIowin S example with two read accesses compiles to the intermediate CDFG shown in Fig. 12. The 
f ^T signals refer only to variable a. STOP1 is the event connection which synchronizesTnTaccesses 
inputs START (old), i and j should be substituted by the actual outputs resulting from the program before 
trie array reads. ■ 

x - a[i] ; 
y = a[j]; 

2 =» X + y; 

$ Pig. 13 shows the translation of the following write access: 
afi) = x; 

14 A ■ basic block is a program part with a single entry and a single exit point, i. e. a piece of straight-line code. ■ 
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Phase 2 We now merge the pseudo-functions of all accesses to the same RAM and substitute them bv 
a single RAM function. For all data inputs (RD for read access and WR and IN for write access) GATEs 
are inserted between the input and the RAM function. Their E inputs are connected to the ripective 
Si? P J? ** 0riginaJ P"******^ If a RAM is read and written at only one program point, 
the U output of the read and write access is moved to the ERD or EWR output, respectively. For example! 
the single access a [i] = X ; from Fig. 13 is transformed to the final CDFG shown in Fig. 5. 

' !^r Ver ' ^ f7f? rcad Pr severaI accesses (L e ' Pseudo-functions from different program points) 
to the same RAM occur, the ERD or EWR events are not specific anymore. But a STAJOT* eve^ of 
the original pseud* function should only be generated for J» mpj£ pro^poltfeZZ^ur 
recess. This is achieved by connecting the START signals of all other accesses (pseudo-functions) 
of the same type (read or write) with the ironed START signal of the current access Thi result- 

f£°T i W f^ e s OT EWR output. The ECOMB's output will only occur after 

die access is completed. Because ECOMB OR-combines its event packets, only the current a^eTs^Z 

to a START new signal which produces a 0-event only after the current access is completed as required 
r^fT* T^*' . SeVCral SOUr<ieS m COMected to *e RD, WR and IN inputs of aRAM This disables 

, ita?2s^sr since ^ one — — - a ^ 

Sev^rWv' T'* ^ ? OUT ° U,pUt ^ * 6 Same P rcblero 35 * e ERD event packets: 
£ 2,^ T^' bUt mUSt ° my ^ ntBd (and fo ^ed to subsequent operators) for 

the current access. This can be achieved by connecting the OUT output via a DEMUX function The Y 
output : of the .DEMUX is used, andthe X output is left unconnected. 5» it acuSSSS^^I 
only forwards packets if its SEL input receives a 1 -event, and discards its data input JTm^SSw. 
O-event. The signal created by the ECOMB described above for the ST ART ^ SleSesTT event 

£5™ ^ ** 3 ^ 0theWiSe - US ^ * 88 SEL acSeve? el™ dS 

"2 to 1 connect!^ START(old) is inverted, 

cnr\\/tr> - , . vwsw.use.it is me auAKI input of the- second read pseudo-function'. 

^ ^ ° UtpUt 1Bd S&ht mfi l^TER/O^ONSTXwcon^S: 

SS^^K gCnerated Shnflar,y ' bUt here START «*» is directly used and STOP? hweSS The 

functions for outputs x and y are connected to the ECOMB outputs related toTrOPl and STARTO^ 
Multiple write accesses use the same control events, but instead of one GATE per access for rt^ *n 

Hist continuous Je?nencej of either read accesses or Mrite accesses (not mixrf) withh. . h..L. m~*. 
- another pseudo-funcoon of the same RAM and the same type (read or wrtte). For these s^cSti 
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possible to stream data into the RAM rather than waiting for the previous access to complete For this 
purpose, a combination of MERGE functions selects the RD or WR and IN inputs in the order given 
by the sequence. The MERGEs must be controlled by iterative ESEQs guaranteeing that the inputs are 
onjy ^warded in the desired order. Then only the first access in the sequence needs to be controlled by 
a GATE or GATEs. Similarly, the OUT outputs of a read access can be distributed more efficiently for 
t a sequence. A combination of DEMUX functions with the same ESEQ control can be used.? It is most 
p efficient to arrange the MERGE and DEMUX functions as balanced binary trcesj 

f ^ START new signal is generated as follows: For a sequence of length n, the START signal of the 
I enare sequence is replicated n times by an ESEQ[0O..l] function with the START input connected to 
the sequence's START. Its output is direcdy «N to 1 connected- with the other accesses' START sisnal 
(for smgle accesses) or ESEQ outputs sent through 0-CONSTANT (for access sequences) ECOMB- 
eonnected to EWR or ERD, respectively, and sent through a 1 -FILTER/O-CONSTANT combination 
similar to the basic method described above. Since only the last ESEQ output is a 1 -event, only the 
fa« RAM access generates a STAET ncw as required. Alternatively, for read accesses, the generation 
of the last output can be sent through a GATE (without the E input connected), thereby producing a 
O-i atu. new event, ° 

St 1 t Sh °^ e ^ tiiniZed V6ISion ° f 0,6 filSt ******* 12 and 4) using the ESEQ-method for 

array reads. Here the latter method for producing the 3TART nsw event is used. 

x.= a[ij 
y - a[j] 
z - a[k] 

for" ^^^"^ ° f re0d SeqUeDCES and sin S ,e read ^cesses occur for the same RAM, 1-events 
to senSriJ fSfiTT aCC T SS mUSt be geneiated f ° r SeqUenC6S ° f read accesses - ™ needed 
a i WNSTANT, achieves this. It is again N to 1 connected" to the other accesses' START signals 
ffor ingle accesses) or ESEQ outputs sent through 0-CONSTANT (for access sequel I InetSS 

S2 ^«?w° l ?w dMe,ibed ^ **> meth0d ' Refer l ° *» SeCODd «amp^igures 
lo ana 1 6) m Section 4.3 for a complete example. 

4.2.8 Input and Output Ports 

Input and output ports are processed similar to vector accesses. A read from an input oort is- like an 

«Wct subsequent operators. The STOP signal is generated in the same way as described abovt for 
RAM accesses by combining the IMPORT'S U output with the current and other START^ignals 

L°for R^ac^eT * ^ ^ ^ — ^ ST ° P ^ b *> 
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4.3 More Examples 

Fig. 7 shows the generated CDFG for the following for loop. . 
a * b + c; 

for (±=0; i<=10; i+4) { 
a = a + i; 
= k; 

} 

In this example, JN1 = {a} and IN2 = {k} (cf. Fig. 10). The MERGE function for variable a is 
replaced by a 2:1 data connection as mentioned in the footnote of Section 4.2 J. Note that only one 
data packet arrives for variables b, c and k, and one final packet is produced for a (out), forbody does 
not use a START event since both operations (the adder and the RAM write) are dataflow-controlled 
* y * e c °»™r tBWf. But the RAM's EWR output is the forbody's START nea and connected to 
CNl .s NEXT input Note that the pipelining optimization, cf. Section 4.2.6, was not applied here If it 
' is applied (which is possible for this loop), CNTs NEXT input is not connected, cf. Fig. 11. Here, the 
loop iterations overlap. START new is generated from CNT's U output and forbody's START Ci e ' 
RAM'sEWRoutput),asdefi n edattheeridofSection4.2.5. rm( 

I ^ blowing program contains a vectorizable (pipelined) loop with one write access to array (RAM) x 
\ and a sequence of two read accesses to array (RAM) y. After the loop, another single read access to v 
occurs. J 

z - 0; ■ ■ 

fox (i=0; 'i<=10; i++) { 
x[i] - i; 

■ z ~ z + y[i) + y [2*iJ; 

} 

a - y£k]; 

i 

1 rtL 15 ^?™ CDFG generated before the array access Phase 2 transformation is ap- 

S^LI? PV ? ]°° P 18 comroUed 25 fon ° ws: loop, separate START signals for write 

accesses to x and read accesses to y .are used The reentry to the forbody is also controlled by two in- 
dependent signals ("cycler and "cycle2"). For the read accesses, "cycle*" JL££%L L rad y 

filS» ^ * f l00P CX,t *" aCCCSSeS mUSl te finished » which is guaranteed by signal "loop 

finished". The single read access is completely independent of the loop. 

Fig. 16 shows the final CDFG after Phase 2. Note that "cyclel" is removed since a single write access 
Z£S3f 5ti0naI COntr °J' " CyClC2 " " rem ° Ved ^ * e ^ ^GE and SeKuucS 

,? T™^ eXeCWi0n ° rder - 1116 read y accesses 316 nw dependent anymore 

since they all refer to the same RAM, and the functions have been merged. ESEQs have been allocated 
*> control the MERGE and DEMUX functions of the read sequence, and for tht firTsSj v£££ 

j funcuons which separate the read OUT values for the read sequence and for the fin J hSSSJd accSf 
Tta 7 ECOMBs. 1 -FILTERS, 0-CONSTANTs and 1-CONSTaW are allocated ^S^sZZ 

\ '> Phase 2» to generate correct control events for the GATEs and DEMUX functions. 
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